►
From YouTube: Scalability Team Demo - 2021-09-22
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
Okay
recording,
so
what
I
wanted
to
talk
about
was
when
we're
introducing
the
new
slis
will
allow
people
to
set
their
own
target
durations
for
requests.
Well,
we
don't
want
to
like
we
don't
want
what
we
have
now,
where
everything
that
is
less
than
five
seconds
is
okay,
there
needs
to
be
a
good
reason
to
set
a
five
second
threshold
on
a
on
an
end
point
and
I'm
trying
to
come
up
with
what
kind
of
what
kind
of
endpoints
this
is.
A
Okay
for
so
far,
I've
like
the
one
thing
I
thought
of
like
is
simple:
there
needs
to
be
a
reason
to
do
it
like
it
needs
to
be
sometimes
slow,
and
we
can
need
to
be
able
to
explain
why
that
is
and
that
there's
no
real
thing
that
we
can
do
about
it.
One
example,
in
my
mind,
is
graphql.
A
A
Yeah
because
people
fill
the
request
with
whatever
they
want,
we
should
just
like
there.
We
will
probably
define
other
things
like
we
will,
like
the
the
queries
that
we
built
for
our
front
end.
We
need
to
meet
different
standards,
and
we
will,
like
I
saw
paul,
was
working
on
ways
of
identifying
those
with
the
operation
name
and
that
kind
of
stuff
and
we'll
probably
set
different.
B
Yeah,
I
think
I
think
that's
now,
because
I
think
there's
also
there's
just
a
lot
of
questions
around
that
that
don't
really
apply
to
any
other
end
point
so
yeah.
Let's
ignore
that
one.
So,
okay,
so
for
increasing
iteration
threshold,
there
should
be
a
solid
reasoning
as
to
why
a
duration
should
be
higher
than
one
second.
A
Yeah,
I
think
I
think
like,
for
example,
the
the
search
controller
endpoints
that
we're
using
elastic
or
something
yeah
are
those
okay
like.
We
know
that
those
don't
spend
time
on
the
cpu,
so
they're
not
really
hindering
other
threads
in
the
same
process,
doing
stuff
and
we're
probably
fine
with
elastic.
B
On
the
database
side,
we
have
a
connection
pool
that
has
a
limited
number
of
connections
available
and
we
don't
want
to
you
know
saturate,
that
connection
pool,
because
we've
got
one
process.
That's
hogging
it
for
a
long
time,
because
it's
taking
a
long
time
on
a
particular
query.
So,
like
I
realized
that
practically
the
answer
is
no
there.
I'm
just
saying,
like
you
know,
hypothetically
like
it
sounds
like
that,
would
equally
apply
to
the
database
antiquity
and
to
reduce
for
that
matter,
whether
it
wouldn't
be
spending
over
five
seconds
anyway,
yeah
exactly
like.
B
A
B
A
B
This
search,
then
those
counts,
loaded,
asynchronously
but
like
I
was
able
to
like
you
know,
go
click
on
this
and
then
they'll,
probably
load
asynchronously
again
or
maybe
they're
cached.
B
When
I
change
tabs
so
yeah
you
can
sort
of
just
see
them
loading
there.
Obviously
in
this
case
they're
fast,
but
if
they
were,
I
mean
if
they
were
over
five
seconds.
I
think
that's
still
bad
right,
yeah
over
five
seconds.
I
I
I'm.
A
A
That's
the
one
that
I
that,
like
that's
the
first
group
that
asked
for
it,
but
like
everybody,
one
of
the
questions
that
came
up
is
like
you've
set
the
the
target
for
everybody
to
be
one
second,
but
there's
no
way
to
change
that
right
now,
so
we're
building
a
way
to
change
that.
But
we
want
to
keep
that
under
control.
So
I
don't
know
which
endpoints
will
want.
A
This
there's
also
the
problematic
thing
of
the
slo
that
doesn't
apply
to
their
budgets,
but
everybody's,
only
looking
at
their
budgets
and
not
the
services
service
availability,
stuff.
So.
A
Like
it's
very
easy,
it's
very
easy
for
us
to
say
we're
going
like
for
for
us
to
say,
like
we're,
going
to
tighten
up
the
threshold
on
that
git
request
thing
because
it
needs
to
be
fast.
Let's
get
the
authentication
people
don't
want
to
wait
for
that,
and
it's
already
fast
it's
already
meeting
slo.
So
we
just
set
that
and
that's
yes.
B
So
yeah,
I
think
I
would
be
tempted
to
say
we
don't
come
up
with
a
set
of
rules.
Now
we
just
take
these
on
a
case-by-case
basis,
so,
for
instance,
for
the
the
search
control
accounts.
We
just
say:
okay,
we're
gonna.
Do
that
we're
gonna
we're
gonna,
try
it
out.
We
might
move
this
back.
We
might
tweak
what
the
slow
threshold
is,
but
we're
gonna
do
that
for
now.
B
Let's
see
what
happens
and
then,
if
we
get
more
requests
we
can
just
say
we
will
approve
them
or
not,
and
then
once
we've
got
a
few
we
can
say:
is
there
something
in
common
with
these?
Do
we
think
all
these
are
valid,
like?
I
think
it's
hard
to
come
up
with
a
rule
when
we
don't
have
examples.
A
A
C
Hello
good
morning,
good
evening,
for
you.
A
A
B
I
mean
yeah,
so
my
yeah,
my
inclination,
would
be
to
say
to
like
lazily,
evaluate
it
basically
to
say
like
okay,
you
stage
groups,
go
fix
this
or
make
a
case
to
us
as
to
why
this
should
have
a
five
second
threshold.
Instead
of
I
mean,
and
some
of
these,
the
top
few
wouldn't
actually
help
yeah
with
a
five
second
threshold,
but
like
okay,
you,
you
make
a
case
for
us,
why
they
should
have
a
five
second
threshold.
Once
we've
got
a
few
of
those,
then
we
say:
okay.
B
This
is
the
group
we're
using
for
deciding,
but
otherwise
we
just
have
to
decide
whether
the
case
making
spaces
start
like.
I
think
I
think
we
might
be
going
too
early
with
the
idea
of
like
coming
up
with
a
ball
for
this.
When
we
don't
know
what
people
are
going
to
ask
for-
and
we
don't
know
why,
because
someone
could
make
you
know
we
can
make
a
rule
based
on
what
we
know
now,
and
someone
could
make
a
perfectly
good
case
for
something
that
doesn't
match
that
rule.
B
Yeah
so,
like
you
know,
why
don't
we
just
let
them
do
that?
Let
them
make
the
case
in
the
first
place
yeah
until
and
until
we
have
a
way
of
making
it
rule,
but
hopefully
we
don't
need
a
way
of
making
it
all,
because
we
only
need
to
do
a
few
yeah.
If.
B
A
B
Yeah,
I
mean,
I
think,
maybe
for
some
of
them.
That
might
not
be
the
case.
I'm
just
trying
to
have
a
look
there
and
see.
If
that's
a
good
example,
this
query
is
linked
from.
Maybe
maybe
the
post
to
artifacts
might
be
a
good
example
like
that's
not
an
interactive
command
anyway,
like
it's
run
by
ci,
and
you
know
it
will
have
higher
tail
latencies,
because
it's
pushing
a
lot
of
binary
data.
I
guess
would
be
a
potential
case
for
that.
One
yeah,
that's
one
of
the
things
I
mentioned
as
well.
Yeah.
C
We're
not
in
control
of
the
the
completion.
Duration.
Generally,
we've
been
using
as
kind
of
exceptional
conditions,
which
seems
related
to
the
problem.
A
Yeah,
that's
that's
excellent
stuff
and
then,
in
the
end,
that's
the
whole
reason
why
we
want
to
push
the
threshold
into
the
application
to
allow
people
from
product
to
say
like
no
nobody's
waiting
for
this.
This
can
be
slow
and
like
if,
as
you
mentioned,
matt.
This
is
a
good
example,
and
this
is
upload
to
somewhere
else.
So
it's
io,
which
means
that
sometimes
the
the
lock
will
be
released
anyway,
because
we're
waiting.
B
Yeah
things
like
the
the
blob
controller
are
interesting
to
me
because
we
also
see
you
know
like
that
will
be
gitly,
but
I
will
also
be
highlighting
and,
like
you
know,
you
can
make
case
a
couple
of
different
ways
there,
but
I
think.
A
B
A
A
B
Sorry
matt
you're,
very
quiet,
I'm
not
sure
why.
But
we
can't
hear
you
it's
just.
C
Thank
you,
okay
thanks!
So
yes,
the
forget
what
I
was
gonna
say.
I'm
sorry,
it'll
come
back
to
me.
Sorry
go
ahead.
A
One
more
thing
that
I
wanted
to
drop
again,
because
I
I
think
we
will
need
to
come
up
with
an
answer
for
this
like
right.
Now,
it's
very
easy
to
explain
their
budgets
because
something
goes
well
like
it's.
It's
success
rate
over
failure
rate
and,
like
it's
easy
to
explain
once
we
turn
that
into
the
way
we
do
the
general
sla
dashboards
with
minutes
where
things
are
good
and
minutes
where
things
are
bad
and
then
taking
averages
of
that
and
so
on.
B
B
C
Yeah
that
could
be
so.
I
remember
what
I
wanted
to
say
earlier
is
that
is
that
sometimes
there
can
be
interaction,
effects
between
requests,
and
so
the
individual
request
isn't
necessarily
responsible
for
its
own
slowness
and
one
of
the
ways
to
kind
of
identify
that
is
to
look
at
the
the
distribution,
like
the
not
exclusively
the
long-tail
distribution
like
what
we
have
here,
but
also
looking
at,
like
the
the
like.
C
A
B
Yeah,
I
think
that's
a
fair
point
and
I
think
that's
that's
also
a
good
reason
to
be
like
lazy
about
this
and
let
the
let
the
state
troops
make
the
case
for
the
threshold
because
they
can
say
like
well.
We've
looked
into
this
this
this
end
point,
and
actually
you
know
it's
it's
slow
because
of
you
know
it
can't
because
of
other
threads
in
the
same
process
or
whatever.
C
You
know,
in
addition
to
this,
this
is
a
fantastic
display
and,
I
think,
maybe
having
in
addition
to
the
count
column,
if
we
could
also
have
a
count
that
exceeded
whatever
the
threshold
we're
we're
talking
about
is
like
five
seconds
they're
kind
of
given.
A
B
A
C
Oh
change
the
the
label
to
just
for
clarity.
B
A
B
B
Yeah,
it's
something
because
it's
a
very
subtle
thing
here
to
stop
by
the
logo.
There
you
go
that
wasn't
too
bad.
Actually,
I
don't
think
yeah.
I
don't
think
there's
an
easy
way
to
do
slow
count
divided
by
count
without
taking
us
out
to
another
program,
but
we
can
sort
of
eyeball
it.
So
like
projects
is
what
one
in
four
to
five
that
are
over
the
threshold,
whereas
events
is
a
very
low
portion
they're
over
the
threshold
yeah.
C
B
Yeah,
in
fact
that
seems
like
what's
that
a
20th,
no
a
50th,
so
yeah
if
we
picked
a
slightly
lower
percentile,
we
wouldn't
have
even
seen
that
one
here
so
in
terms
of
okay,
yeah,
the
the
slightly
lower
percentile
is
here
already
yeah.
Sorry,
I
mean
like
because
it's
a
50th
of
request
over
one
second,
if
we
took
the
98th
percentile,
that
would
probably
be
pretty
much
bang
on
one
second.
A
Yeah
but
that's
not
an
slo
like
that's.
No,
no.
B
B
A
B
A
tail
like
that's
a
case
where
the
tail
latency
is
high,
but
the
median
latency
and
again
you
can
see
the
media.
Latency
is
pretty
good.
The
project
one
seems
more
like
it
would
be,
not
exactly
a
normal
distribution
but
something
a
bit
with
a
you
know:
chunkier
tail.
C
That
project
events
one
the
the
top
one
there
I
mean
it
clearly
doesn't
represent
a
large
like
if
we
were,
if
you
were
scoring
this
in
terms
of
how
many
requests
caused
as
user
to
be
unsatisfied
like
the
traditional
optics
definition,
then
this
wouldn't
be
a
high
value
target,
because
there's
only
19
000
requests
that
you
know
out
of
clearly
over
a
million
requests
summed
across
all
these
endpoints,
but
but
that
outlier
is
outrageous.
15
seconds.
B
Time
times
out,
very
often
but
like
yeah,
which
is,
which
is
why
I
was
running
about
the
offset
pagination
as
well,
and
this
also
this
also.
B
B
Second,
it
doesn't
matter,
it's
fine,
so,
okay,
so
what
else
have
we
got
here
so
again
like
the
the
pages?
Api
again
is
even
more
extreme
than
that,
because
it's
got
20
000
to
4
million,
and
it's
99.5
is
in
fact
below
one
second
anyway.
A
B
A
C
A
C
B
A
B
C
C
A
B
Yeah,
anyway,
that's
that's
why
I
would
like
to
push
this
onto
the
stage
teams
at
first,
so
they
can
like
look
at
each
endpoint
that
they
own
and
say
like.
Okay
is
this
and
what
would
trigger
you
to
say
no
well,
so
for
something
like
the
the
events,
one,
I
would
be
tempted
to
say
no,
because
it
is
actually
pretty
close
to
meeting
this
sli
like
if
you
look
at
the
distribution
like
it's
pretty,
it's
pretty
close.
It's
not!
B
A
But
wait
the
the
knight
this
is
the
this
is
the
no.
This
is
the
sli
for
the
slo
for
api,
which
this
is
handling
and
the
duration
there.
So
the
99.5
percentile
is
in
two
seconds,
so
that
would
still
require
an
increase
from
one
second
to
default.
B
To
five,
no,
no
all
I'm
saying
is
like
I
think,
it's
easier
for
the
development
team
to
get
a
handle
on
the
events
api,
because
what
is
that
I
can't
do
math
in
my
head.
B
Two
percent
of
requests
are
exceeding
it
as
opposed
to
api
version,
api
project,
which
is
top
row
where
25
of
requests
are
exceeding
it
like
that's,
I
I,
I
don't
think
that's
a
sort
of
theoretical
argument.
I
think
that's
just
a
practical
argument
like
if
you
have
25
percent
of
requests
for,
and
we
know
that
the
projects
api
will
support
a
bunch
of
different
options
same
with
the
group
projects,
one
where,
like
you
know,
you
can
sort
it
this
way.
You
can
include
these
fields.
B
B
That's
very
different
to
the
project
defense,
one
where
it's
it's
already.
Two
percent,
like
pages,
is
even
closer
because
that's
what
twenty
twenty
thousand
out
of
four
millions?
That's!
Oh!
Well,
it's
it's!
Actually
99.95
is
fine.
It's
only
at
99.8
that
it
gets
that
it
gets
bad.
I
think
those
ones
are
the
ones
I
would
be
tempted
to
say,
like
we
won't
increase
the
threshold
for
because
you're
already
yeah.
So
this
one,
this
one
is
interesting.
A
B
Yeah,
but
I
think
I
think
as
well
there's
a
case.
That's
that's
more
of
an
interactive
request,
even
though
it's
an
api
request,
because
this
is
made
by
git
lab
pages.
When
you
visit
a
gitlab
page
and
because
there's
a
static
pages,
they
shouldn't
be
taking
five
seconds.
So
there's
there's
also
yeah.
This.
B
Basis
for
that
one,
but,
like
you
know
this,
is
this
because
of
the
usage
pattern
and
like
maybe
the
dispatch
one
is
kind
of
in
between,
because
I
think
that's
triggered
when
you
scroll
on
the
diffs
page
on
the
merge
requests.
So
you
know
maybe
there's
a
little
bit
more
tolerance
than
for
the
first
page
load,
but
it's
still
interactive
but
yeah.
B
My
temptation
is
still
to
just
start
by
saying,
like
you
know,
make
a
case
like
make
the
stage
group
make
a
case.
We
can
say
what
like
exceptions,
we've
added
before
and
why
but
also
look
at
yeah
and
then
also
make
a
call
like.
B
Histogram
they
can
plug
in
their
caller
id
into
and
show
what
the
distribution
I
mean.
The
cabana
histograms
are
very
good
though,
but
like
if
they
can
show
what
the
distribution
of
response
times
is
as
well,
because,
like
you
can
see
from
the
summary
statistics
we
have
here
again,
projects
25
are
exceeding
that
threshold.
B
You
know
the
pages
one
it's
one
percent
they're
exceed
or
just
over
one
percent.
Sorry
not
pages.
That's
tags
pages!
It's
about
half
a
percent
that
are
exceeding
threshold.
Those
are
those
are
very
different,
like
gonna,
be
very
different.
Histograms
for
those
and
they'll
be
very
different
paths
to
getting
them
to
to
to
meet
that
threshold.
If,
if
the
stage
group
can't
do
that,
okay,
the
other
thing
is
this
might
be
going
on
a
tangent
a
bit
for
what
I
remember.
B
B
Yeah
exactly
like,
if
you
have
a
request,
that
an
endpoint
that
has
a
million
requests
per
day
and
one
that
has
10
000
requests
per
day.
I
know
that's
a
bad
example.
That's
not
the
right
questions,
but
you
get
the
idea.
If
you
have
one
that
completely
dominates
your
your
request
count,
and
that
is
always
fast,
then
it
almost
doesn't
matter
what
your
other
request
endpoints
do,
because
when
you
aggregate
it
up
by
stage
group,
you
will
probably
still
meet
the
threshold
yeah.
A
Because
most
of
your
things
are
fast
likewise
like
if
you
have
a
thing
that
has
a
million
requests
and
just
because
it
has
so
many
requests.
You
know
it
has
10
000
slow
ones,
that's
still
pretty
good,
but
if
you
have
one
below
that
has
5
000,
slow
ones,
but
only
10.
B
C
I
mean:
do
you
think
it's
really
important
to
give
some
some
guidance
about
how
to
prioritize
these,
because
if
it's,
if,
if
we
present,
if
we
present
a
case
where
you
need
to
have
a
certain
percentage
of
endpoints
per
stage
group
that
meets
the
the
target
response
time,
that's
that
gives
a
huge
incentive
to
prioritize
whatever's
easiest
to
to
improve,
rather
than
improving
the
end
user
experience.
C
Goal
of
the
of
of
scoring
based
on
aptx
is
to
improve
the
end
user
experience
so
so
prioritizing.
For
example,
a
high
high
throughput
api
endpoints,
where
end
users
aren't
actually
waiting
for
the
response
will
improve
the
stage
group's
score
without
actually
improving
user
experience.
So
I
think
we
should
explicitly
state
that
the
goal
here
is
to
improve.
You
know
things
that
end
users
will
perceive
as
being
fast
yeah.
C
Yeah
that
that
example
for
the
gitlab
page's
api
call
is,
I
think,
fantastic
because,
like
you
know,
kind
of
the
the
initial
gut
reaction
is
going
to
be
well.
Api
calls
are
generally,
you
know
not
a
thing
that
humans
are
waiting
on,
but
in
this
case
it
certainly
is
I
mean
it's.
It
makes
a
good
case
study
is
all
I'm
seeing.
B
B
B
No
for
sure
I
think,
yeah
the
examples
we
discussed
there
are
probably
reasonable
cases
to
start
with.
Like
you
know,
yeah
I'm
going
to
let's
post
is
like
probably
not
that
important
to
end
user
experience.
I'm
going.
A
To
to
bring
up
the
three
examples,
I
think
I'm
I'm
going
to
bring
him
up
in
the
issue
for
now,
but
I
don't
know
how
I
translate
that
to
documentation.
But
the
three
examples
that
I
take
away
are
like
the
internal
pages,
one
that
should
probably
have
a
tighter
threshold,
because
people
are
actually
waiting
for
it
and
they
should
not
lose
a
threshold.
A
Yeah
yeah,
the
there's,
the
projects
endpoint.
That
has
a
million
ways
of
calling
it,
and
some
of
them
are
just
more
work
to
do,
and
therefore
it's
slower
and
there's
not
a
lot.
We
can
do
about
that
and
then
there's
the
artifacts
uploading,
which
uploads,
I
think,
which
is
moving
data
around
it's
going
to
be
slow
and
also
not
a
lot.
We
can
do
about
it,
but
also
not
very
bad,
because
it's
not
happening
on
but
like
it's
not
occupying
the
worker
all
the
time.
While
it's
doing
that,
I
think.
A
Well,
if,
if
we
can,
we
have
three
cases
there
if
we
can
correlate
requests
that
we
get
from
stage
groups
to
one
of
these
like
there
was
some
stuff
that
I
worked
on
with
the
people
from
registry
as
well,
where
the
requests
were
slow
but
yeah,
they
were
sending
three
gigabytes
worth
of
data
across
the
wire.
It's
not
going
to
be
done
in
a
second.
B
Let
me
just
quickly
share
this
stuff
about
the
rate
limiting
redis
instance,
so
we
are
adding
we're
doing
another
functional
partition
of
redis,
so
we
currently
store
rate
limiting
data
for
two
different
ways.
We
do
rate
limiting.
So
one
is
the
third
party
rack
attack
library,
which
happens
on
the
request
level
before
the
request
is
like
even
passed
to
rails,
and
then
there
is
what
we
call
application
rate
limiting,
which
is
inside
a
request.
We
can
say
like
if
you're
doing
something
expensive
like
an
export,
you
can.
B
Each
user
can
only
do
one
export
per
period,
and
so
we
have
to
two
things:
there
we're
going
to
use
this
new
redis,
both
of
them
like
the
cache
it
can
be
configured
as
an
re
liu,
because,
it's
rate,
let
me
see
information,
it's
generally
very
short-lived.
B
So
most
of
what
I've
been
doing
on
this
so
far
is
adding
the
ability
for
the
application
to
configure
that
which
is
mostly
been
about
learning
about
how
this
already
works.
So
the
that's,
the
adding
a
class
but
yeah.
So
we
add
a
class.
So
I'm
just
going
to
share
this.
So
this
is
the
current
state
of
this
file,
so
this
is
config.remedy.md,
so
it's
docs,
but
it's
not
actually
on
a
doc
site.
B
This
is
mainly
used,
I
think,
for
source
installs,
I'm
not
100
sure
what
the
audience
is,
but
we
have
very
detailed
documentation
on
each
redis
instance
and
there's
some
slightly
confusing
parts
like
redis
cache
default
support,
6380
on
localhost,
redis
queues,
defaults
to
6381
and
credit
shared
state
defaults
to
6382.
So
this
is
only
if
you
have
no
configuration
files
at
all,
so
this
is
apparently
supporting
some
very
legacy
use
case.
What
I've
tried
to
do
here
is
simplify
this.
B
So
when
we
add
a
new
one,
we
don't
have
to
copy
and
paste
all
of
that
stuff
and
we
can
hopefully
avoid
making
the
many
many
copy
and
paste
errors
that
I
made
when
I
was
working
on
this,
so
we
just
say,
like
you
know,
you
have
an
instance.
It's
called
this.
It's
going
to
look
in
this
environment
variable
based
on
this
name
or
it's
going
to
look
in
this
file
or
if
it's
available,
it's
going
to
look
in
this
file.
B
If
there's
no
url,
it's
going
to
use
localhost
and
this
port
is
the
basic
idea,
so
these
are
hard-coded
in
the
application
and
previously
the
trace
chunks
didn't
have
a
fullback,
sorry
a
default
at
all.
So
if
there
was
no
redis
configuration,
the
trace
trunks
redistance
would
work,
which
I
think
is
fine,
because
that's
optional.
But
if
this
is,
I
have
no
idea.
If
this
is
a
case,
we
need
to
support
or
not.
But
if
it
is
a
case
we
need
to
support
for
self-managed
instances.
B
Then
we
do
need
that
for
the
rate
limiting
one,
because
it's
much
more
likely
that
rate
limiting
is
turned
on
trace
chunks,
because
traditions
need
a
feature
flag
and
rate
limiting,
doesn't
and
when
we
move
sessions,
that'll
be
even
more
important
because
literally
every
lab
instance
uses
uses
sessions.
So
there's
no
way
we
can
get
away
without
that.
So
that's
the
basic
idea.
A
B
With
the
port,
because
I
mean
it's
hard
to
even
notice-
that's
a
thing
like
you
need
to
literally
have
no
configuration
files
whatsoever,
which
is
like
unusual.
I
don't
even
know
how
you
would
have
redis
running
on
those
ports
in
that
case,
but
apparently
it's
a
thing.
The
other
thing
is,
I
noticed,
and
this
is
just
a
weird
thing,
but
I
figured
it
was
worth
cleaning
up.
B
So
in
the
gdk
beginner
development
kit,
we
set
different
redis
databases
for
development
and
tests
for
each
instance,
so
they
act
as
isolated
instances
and
that's
helpful
because
a
if
you
have
a
key
clash,
although
you
shouldn't,
but
you
could
have
a
key
flash
potentially,
but
also
be
if,
before,
like,
we
just
use
the
same
instance
for
the
same
database.
Sorry
for
you,
you're
logged
out
every.
B
Tests,
yeah
development
and
test.
So
if
you,
if
you
run
the
tests,
they
clear
redis
that
logs
you
out
that
deletes
all
your
psychic
queues,
that
does
everything
and
these
database
numbers
in
these
configuration
files,
which
are
again
I'm
not
sure
how
useful
they
are
would
different
to
the
ones
we
use
for
gdk.
But
I
figure
as
their
example
files.
I
can
just
make
those
match
up,
so
I
basically
yeah
I'm
just
trying
to
make
that
a
little
bit
easier
in
future
because
it
was
it's.
B
This
isn't
hard
adding
a
new
instance.
It's
just
tedious,
because
there's
a
bunch
of
like
little
paper
cuts
like
this.
That
will
trip
you
up
and
it's
mostly
pretty
roped,
but
as
we're
gonna
be
asking
the
memory
team
to
add
the
one
for
sessions,
it
seemed
like
a
good
time
to
to
take
a
bit
of
time
to.
A
Do
that
jacob
already
did
a
lot
of
work
when
he
added
the
trace
chunks,
one
like
to
make
it
a
little
bit
more
manageable
code
wise
to
do
that.
But
apparently
you
missed
the
port
bit.
B
Yeah
well
exactly
so.
This
is
the
thing
like
yeah.
Basically,
I
submitted
this
at
marcia,
but
I
was
like
this
looks
pretty
good
to
me
and
yeah.
I
was
like
well
you've
copy
and
pasted
the
wrong
thing
here.
You
copy
and
paste
the
wrong
thing
here.
This
port,
you
know,
comes
from
here
and
I
was
like
okay
right.
Let
me
go
fix
that
and
then
he
spotted
a
bunch
more
things
and
then
I
was
like
okay.
Let
me
go
fix
that
and
he
was
off
today.
So
I
was
like
okay.
B
Let
me
let
me
take
a
step
back
here.
Instead
of
playing
with
them
all
and
see
what
see
what
I
could
actually
learn,
because
this
is
clearly
not
as
straightforward
as
I
thought,
and
it's
not
going
to
be
straightforward
for
the
next
person
to
do
it.
So
that's
where
I
am
with
that
and
then,
while
I
also
have
that
thing
that
the
future
has
been
reasons,
so
people
started
to
figure
out
where.
B
Just
the
context
map
society
has
a
client
middleware
and
a
server
middleware,
so
quiet
middleware
is
run
when
you
schedule
a
job
and
server
middleware
is
run
when
the
psychic
worker
process
picks
up
the
job,
and
previously
I
had
added
something
to
the
client
middleware
for
feature
categories,
so
that
would
get
the
correct
feature
category
in
the
logs,
but
then
that
didn't
work
for
metrics.
B
So
now
I'm
like
going
back
to
like
do.
I
need
to
have
this
in
two
places,
but
I
deliberately
deleted
the
client
middleware
because
it
was
causing
problems
with
the
metrics,
so
I'm
having
difficulty
getting
the
logs
and
the
metrics
to
line
up
and
bob's
gonna
have
to
review
that
when
I
get
to
it.
But
at
the
moment
I
would
say
the
metrics
are
right
and
the
logs
are
not
what
I
want,
which
is
the
better
way
around,
because
the
logs
are
more
transient
and
also
the
met.
B
The
the
logs
are
wrong
in
a
way
that
makes
them
higher
cardinality,
which
is
okay
for
logs,
but
wouldn't
be
okay
for
metrics.
So
when
I
say
wrong,
I
mean
there's
two
reasonable
positions
you
could
choose.
It
just
happens
that
they
don't
line
up
so
yeah.
I
need
to
fix
that,
but
I
don't
really
have
anything
to
do
over
there,
because
I
haven't
figured
out
how
to
do
it.
C
For
the
for
the
the
transitional
state,
when
we
begin
to
roll
out
the
oh,
my
gosh,
I'm
blanking
out
the
topic
we
were
just
talking
about.
A
C
Right
now,
what's
its
name?
Yes,
the
rate
limiting
things
I'm
assuming.
So
I
wanted
to
check
my
assumptions
here,
I'm
assuming
that
we're
not
going
to
make
any
attempts
to
to
transfer
state
information
from
from
the
the
for
the
for
the
current
state
of
written
limiting
when
we
roll
this
out,
we'll
just
start
the
clean
slate.
So
everyone's
you
know,
consumption
rate
goes
back
to
zero.
I
think.
B
C
B
Yeah
so
like
you're,
really
like,
if
you're
using
a
feature
flag
for
that,
like
you'd,
really
be
struggling
to
notice
the
difference
either
way,
I
think
the
longest
interval
we
have
for
an
application
rate
limit
is
on
the
order
of
single
digit
minutes.
A
B
With
those,
I
think,
I
think
the
current
plan,
which
I
need
to
discuss
with
craig,
because
I
haven't
gone
around-
writing
it
up
is
that
we
won't
attempt
to
transfer
data
or
do
multi
rights
or
multi
reads,
but
we
will
feature
flag
them
so
that
we
don't
we
don't
switch
them
both
over
at
the
same
time
so
like
we
can
switch
just
rack
attack
or
just
the
application
rate
limiting
okay,
but
but.
B
So
like
because,
like
you
said
it's
a
lot
of
very
simple
to
reason
about
yeah
yeah.
B
To
try
and
cut
that
data
over
which
we
will
need
to
do
for
sessions,
yes,
yes,
and
that
will
probably
need
to
be
a
multi-read
multi
right,
yes,
and
also
potentially
creating
the
sessions
instance
based
on
the
a
secondary
the
backup
from
well.
I
don't
even
know
I
haven't
thought
about
that
far.
A
C
Yes,
exactly
because
we're
not
under
attack
most
of
the
time
we're
not
under
an
abusive
workload
most
of
the
time
and
we're
not
going
to
rule
this
out
while
we're
under
a
massive
attack.
So
you
know
we're
giving
someone
we're
giving
someone
a
free
pass
for
a
few
minutes
to
you
know
to
you
know:
double
the
number
of
requests
they're
making
to
some
end
point
that
would
otherwise
be
limited,
and
that
seems
okay
yeah.
If.
A
A
C
Yeah,
that's
a
great
question,
though
I
guess
I
would
still
be
because
it
seems
like
a
a
complex
scenario
that
would
be
easy
to
get
wrong.
I
guess
I
would
probably
still
lean
towards
not
bothering
with
the
migration
this
time,
because
we
can
get
away
with
it.
Yeah
I
mean
if
we
were
talking
about
a
rate
per
hour.
If
it
was
a
rate
per
day,
then
I'd
be
a
little
bit
more
hesitant
about
skipping
that
step.
Yeah.
B
I
would
get
back
to
matt's
point
from
earlier
about
like
we're
not
under
attack
most
of
the
time.
We
would
do
it
when
we're
under
attack.
So
we'd,
probably
just
do
that
at
quiet
periods
like
not
even
a
quiet
compared
to
under
attack,
just
a
quite
a
low
traffic
period
in
general-
and
you
know-
hopefully
we
do
that
in
a
quiet
hour,
so
that
by
the
time
the
hour's
up
we're
good
yeah.
B
I
think
I
think,
for
this
one
it's
I
mean
it's
just
so
much
simpler
to
implement
to
roll
out
to
everything,
but
there's
far
fewer
failure
modes
this
way
and
it
fails
open,
which
is
also
useful
for
users.
Yes,.
C
B
Absolutely
yeah,
I
think,
thanks
for
thanks
for
bringing
that
up
matt,
I
do
need
to
actually
write
that
up
for
great,
because
that's
what
I've
been
thinking
about
and
got
around
to
writing
down.
C
More,
I'm
I'm
not
awake
enough
to
talk
about
the
things
that
are
kind
of
top
of
mind
right
now,
so
maybe
next
time
there's
there's
been
there's
been
a
lot
of.
I
think
I
think
I
think
there's
been
some
very
interesting
discussion
around
around
diddly's
four
clock
or
rather
go
the
the
ghost
standard
library,
four
block
and
how
that
interacts
with
giddily's
child
process
caches
like
the
cap
file
cache.
But
again
I'm
I'm
not.
C
I
haven't
had
my
tea
sink
in
yet
so
I'm
I'm
not
quite
alert
enough
to
to
have
a
a
discussion
about
it,
but
maybe
maybe
just
maybe
just
kind
of
a
toe
dip.
If
that's
okay,
so.
C
So
there's
as
as
I'm
sure
you
both
know,
we've
got,
we've
got
this.
This
kind
of
fairly
complex
system
of
maintaining
a
pool
of
of
child
processes.
These
get
cat
file
processes
that
did
label
will
will
preserve,
rather
than
rather
than
making,
rather
than
allowing
them
to
be
purely
ephemeral
processes.
That's
that
satisfy
one
grpc
call
will
keep
them
around
in
anticipation
of
potentially
having
another
request
for
the
same
repository.
C
That's
compatible
with
with
this
process,
and
this
this
saves
us
roughly
about
a
50
of
the
time
we
don't
have
to
fork
a
new
process
and
there's
been
some
there's
been.
Some
very,
I
think,
healthy
discussion
about
whether
or
not
this
historical
decision
from
a
few
years
ago
is
still
the
right
choice
in
terms
of
the
complexity
versus
benefit
trade-offs,
one
of
the
things.
C
So
that's
that's
one
discussion
and-
and
I
feel
like
that
and
and
there's
a
separate
discussion
that
I
feel
like
is
about
to
converge
where,
as
part
of
part
of
incident
response,
follow-up
we've
been
igor,
and
I
and
a
few
others
have
been
kind
of
intermittently
over
the
last
week
digging
into
some
some
some
concrete
examples
of
long
tail
response
time
that
ultimately
seemed
to
be
like
for,
for
the
fine
commit
endpoints
was
was
one
example.
C
There
were
a
few
other
endpoints
that
initially
we
were
investigating
separately
and
one
of
the
common
attributes
that
we've
uncovered
is
is
a
burst
of
cpu
contention
and
b.
The
it's
a
little
hard
to
talk
about
this
without
without
a
visual
aid.
But
let
me
let
me
just
describe
it
in
broad
strokes.
There
are
two
layers
of
scheduling
that
matter
in
in
this
context,
what
I'm
working
up
to
is
the
go.
C
In
addition
to
that,
giddily
has
has
an
additional
layer
that
that,
what's
it
called
spawn
token
that
imposes
an
additional
rate
limit
and
I'm
not
entirely
sure
what
I'm
not
entirely
sure
how
beneficial
that
is,
because
there
is
already
this
this
underlying
fork
fork
limit.
But
what
I
want
to
talk
about
is
is
this:
is
this
this
global
mutex?
It's
it's
a
it's
a
reader
writer
lock.
So
it's
effectively
a
semaphore
that
guards
whether
or
not
the
given
go
process.
C
Giddily
in
this
case
is
allowed
to
create
create
a
trial
process.
So
it
turns
out
that
that
in
at
least
some
of
the
cases
we're
not
sure
what
the
percentage
is,
but
in
at
least
some
traced
examples.
C
We
find
that
there
are
10
many
tens
of
milliseconds
of
waiting
for
that
lock
to
be
acquired,
which
implies
that
that
a
there's
that
either
there's
a
burst
of
demand
for
that
lock,
where
there
are
many
many
sorry
each
scheduling
process
has
32
os
threads
that
many
go
routines
can
be
scheduled
onto
I.
I
started
to
talk
about
this
earlier,
but
there
are
two
schedules:
two
schedulers
that
matter.
One
is
the
kernel.
C
Scheduler
to
see
which
which
tasks
get
scheduled
on
cpu
and
the
other
is
the
go
runtime
scheduler
that
will
schedule
go
routines
to
run
on
these
long
lived
os
threads,
the
the
ps
in
goes
kind
of
scheduler
model.
C
So
effectively
go
can
say:
okay
go
routine,
a
that
wants
to
that
perhaps
wants
to
acquire
this.
This
four
block
you
were
allowed
to
start
running
on
this.
You
know
this
designated
os
thread.
That
kernel
will
will
schedule
that
os
thread,
which
is
has
entered
the
runnable
states
from
kernel's
perspective,
to
run
on
a
cpu,
maybe
sometime,
here's
the
catch
by
default.
Go
runtime
is
going
to
allocate
32
os
threads
on
a
machine.
C
That's
got
32
cores,
but
we
know
that
on
our
digitally
boxes
the
get
child
processes
consume
a
significant
percentage
of
cpu
time,
sometimes
often
more
than
giddily
itself,
and
so
there's
often
when
we
have
cpu
connection
at
the
host
level
at
the
kernel
level,
those
those
those
os
threads
that
gideon
that
the
go
runtime
is
managing
for
italy.
They
don't
actually
get.
C
You
know
access
to
cpu
as
often
as
they
think
they
do
they're
contending
for
cpu
at
the
kernel
scheduler
level,
with
all
of
the
child
processes
that
they're
trying
to
manage
we're,
not
sure
how
big
of
a
deal
this
is,
but
it
has
to
have
some
influence,
whether
it's
a
major
influence
or
a
minor
influence
is
an
open
question,
but
it
has
to
have
some
degree
of
influence
and
it
may
that
by
influence
I
mean
the
the
question
we're
trying
to
answer
right
now
is
is
the
what
is
what
is
influencing
the
duration
of
that
lock
contention
for
for
this
four
block
and
they're
kind
of
two
they're
again
in
broad
strokes,
they're
kind
of
two
two
angles
that
can
kind
of
influence
it?
C
One
of
them
is:
was
there
a
burst
of
requests
for
forklogs
were
many
where
many
go
routines
want
to
to
fork
our
child
process
at
roughly
the
same
time
or
do
we
have
a
stable
rate
of
incoming
requests
for
that
lock?
But
the
luck
duration
has
increased
so
on.
There
are
a
few
ways
we
can
kind
of
test.
We
can
differentiate
between
those
two
cases
in
in
the
in
the
latter
case,
where
the
the
lock
duration
is
is
longer
than
usual.
C
This
is
this
is
one
of
the
ways
that
cpu
contention
that
we
were
talking
about
a
couple
minutes
ago
may
be
coming
into
play.
So
this
is
this
is
very
much
an
open,
an
open
research
topic
and
we're
kind
of
putting
time
into
it,
as
as,
as,
as
you
know,
as
time
allows
it's
not
like
a
primary
focus
for
anyone.
C
I
think,
as
a
I
have
a
couple
of
ideas
for
for
next
steps-
and
I
know
igor,
has
a
couple
ideas
for
next
steps
as
well,
and
I
think
we
haven't
talked
in
a
couple
days
about
it,
but
I
feel
like
this
is
this
is
a
worthwhile
thing
to
look
into
I'm
not
leading
up
to
anything.
By
the
way,
I
should
have
said
that
to
start
with,
this
is
just
kind
of
rambling
about
an
interesting
problem.
C
I
think
this
has
relevance
in
in
terms
of
in
terms
of
kind
of
the
long-term
planning,
because
two
reasons
one
it
may
the
outcome
of
this
analysis
of
of
of
this.
Essentially,
lot
contention
over
the
ability
to
create
child
processes
has
an,
I
think,
has
a
meaningful
impact
on
whether
or
not
on
the
on
the
discussion
of
whether
or
not
having
a
cached
pool
of
child
processes
like
the
get
cat
file.
C
Cache
is
or
is
not
a
good
design
choice
still,
because
we
we
know
that
we're
we're
cutting
in
half
the
the
number
of
times
we
want
to
do
those
forks,
and
I
guess
the
there
are
a
few
ways
to
kind
of
address
that
if
we
don't
if,
for
any
reason,
we
we
significantly
increase
the
demand
for
forking
child
process
like
like,
for
example,
removing
the
git
file
cache.
C
There
are
a
couple
ways
we
can
address
that
the
the
kind
of
conceptually
simplest
one
being
spread
the
work
across
more
piddly
processes,
which,
in
my
view,
means
spread
the
repos
across
more
deadly
nodes.
There
are
a
few
trade-offs
in
in.
Let
me
leave
that
aside.
I
guess.
C
The
other
thing
I
wanted
to
mention
is
so
the
the
question
about
the
the
benefit
of
having
of
having
a
pool
of
cached
child
processes
that
are
potentially
usable
is,
is,
I
think,
influenced
by
this
by
this,
by
what
we
uncover
here
with
the
four
block
contention
and
the
other
piece
is,
is.
C
Being
aware
of
the
cause
of
the
contention,
if
it
turns
out,
for
example,
to
be
contention
at
the
cpu
scheduling
level,
then
we
can
do
something
about
that,
with
no
code
change
and
and
and
without
having
to
move
reapers
around
by,
for
example,
explicitly
telling
the
go
runtime
to
continue
to
to
provision
32
os
threads,
but
but
resizing
the
giddily
hosts
to
have
say
50
or
100
more
cpus
available,
so
that
there's
less
contention
at
the
at
the
colonel's
scheduler
level.
C
Does
that
make
sense
without
reducing
capacity
in
terms
of
schedule
and
go
routines,
because
that
also
could
change
bottlenecks?
Are
the
the
number
of
child
processes
limited?
Somehow?
C
Oh,
that's
a
great
question.
We're
not
running
in
yes,
there
is.
There
is
a
hard
limit
on
the
number
of
child
processes,
but
we're
not
anywhere
close
to
it.
B
It's
more
the
creation
rate.
That's
okay!
Yeah
I've
seen
a
couple
of
issues
around
this,
so
forgive
me
if
I'm
covering
some
things,
so
I
did
see
one
where
patrick
was
mentioning
that
we
always
spawn
two
processes
as
part
of
the
cache,
even
in
the
cases
where
we
only
use
form
one
and
then
a
refactor,
he
did
to
try
and
lay
the
groundwork
to
remove
that
ended
up
with
go
routine,
unique.
So
clearly
the
code
itself
is
quite
hard
to
reason
about
modify
safely.
Yes
also,
I
guess
the
challenging
aspect
of
it.
C
Yes,
I
think
I
I
also
read
about
that,
although
probably
not
in
a
great
detail-
patrick,
I
think,
was
one
of
the
people
that
was
saying
that
that
this
this
cat
fell
caches
is
quite
complex,
that
that
adds
a
significant
burden,
maintenance
burden
to
the
code,
because
it's
fairly
complex-
and
I
I
love-
I
love
that
we're
talking
about
you-
know
considering
the
trade-offs
of
of
jettisoning
a
complex
piece
of
code
if
it's
not
providing
enough
value
to
justify
it,
and
I
kind
of
suspect
that
we
may
know
more
now
than
when
we
added
that
code.
C
In
terms
of
like
the
contention
I
still
want
to.
Personally,
I
still,
I
still
want
to
understand
the
the
interactions
between
our
giddily
specific
spawn
token
mechanism
versus
the
the
four
clock,
which
I
suspect
is
kind
of
an
implicit
additional
concurrency
limiting
factor.
B
C
Time
we
at
the
time
we
introduced
this
like
yeah,
it's
you
know
and
the
go
runtime
changes
over
time
as
well.
I
mean.
C
So
it
may
be
obsolete
at
this
point
and
we
can't.
C
So,
in
terms
of
like
next
steps,
I
I
feel,
like
you
know,
a
discrete
helpful
question
to
answer
is:
is
differentiating
between
those
two
cases
of
do.
We
have
a
spike
in
demand
for
this
lock
or
does
the
lock
duration
increase
when
we
have
one
of
these
events,
where
there's
there's
significant
contention
and
it
affects
the
grpc
response
time.
C
I
kind
of
had
that
in
the
back
of
my
mind
when
we
were
looking
at
the
earlier
in
this
talk,
when
we
were
looking
at
their
response
time,
distribution
on
a
per
endpoint
basis
that
we
can
have
these
kind
of
subtle
interaction
effects-
and
this
is
obviously
not
the
only
way
that
can
happen,
but
because
it
was
kind
of
you
know
top
of
mind
after
looking
at
it
in
the
last
couple
of
days.
C
It
was
hard
not
to
think
about
that
in
in,
in
the
context
of
trying
to
evaluate
long
tail
response
times,.
C
That's
that's
again
not
leading
up
to
anything
I'm
just
kind
of
using
about
the
topics
we've
talked
about.