►
From YouTube: GitLab Production Incident 5158 Corrective Actions
Description
Stan, Matt, Andrew, Jason, Marin and others discuss some corrective actions following on from a production incident: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5158
A
A
So
there's
lots
of
fixes
that
we
need
to
do,
and
some
of
them
are
more
urgent
than
others,
and
I
think,
from
my
point
of
view,
the
the
first
thing
that
we
should
do
is
address
suggestion
of
basically
moving
the
authorized
projects,
workers
to
their
own
queue
or
sorry,
their
own
sidekick
shot,
and
that
will
that
will
be
an
immediate
like
temporary
relief.
While
we
talk
about
the
other
options,
sort
of
the
long
term,
like
really
big
project,
is
getting
rid
of
carrier
wave.
A
I
think
that
that's
kind
of
like
the
the
the
we
can't
you
know,
lester
has
been
saying
this
for
two
years
and
we
have
to
do
it
because
this
is
this.
Is
you
know
bad?
And
then,
in
between
that?
A
I
I
would
say,
from
my
point
of
view,
the
reactive
caching
issue,
with
the
that's
basically
masked
all
of
the
alerts,
because
we've
been
ignoring
the
alerts
for
a
month
that
we
should
fix
as
well,
because
we've
been
ignoring
the
all
the
threat
contention
sidekick
issues
because
of
this
known
reactive,
caching,
issue
and
stan
said:
there's
actually
because
I've
seen
on
the
on
the
on
the
issue
for
that
that
people
have
said.
Oh,
this
is
going
to
be
really
hard,
but
stan
suggested
quite
a
simple
solution.
B
B
A
I
will
also
do
this
is
something
like
every
time
and
this
isn't
actually
gcs
issue,
but
it
might
have
helped
us
if
it
was,
and
also
would
have
pointed
us
in
that
general
direction
is
every
time
we
have
a
gcs
issue.
We
talk
about
creating
a
service
in
our
service
catalog
in
our
metrics
catalog
for
gcs
and
it'll
basically
have
two
slis
one
is
like
sidekick
object,
storage
and
the
other
one
will
be.
Registry
object,
storage
and
perhaps
pages
as
well.
A
I
don't
know
if
we've
got
metrics
from
pages
for
its
gcs
operations
and
then,
if
any
of
those
things
go
south
we
can
go
and
look
there
and
one
thing
is
we'd
see
if
it
really
was
gcs
we'd
see
all
three
of
them
dropping
and
in
this
case,
would
only
be
the
sidekick
one.
So
we'd
know
there's
something
else.
That's
the
the
problem,
and
that
should
be
quite
easy
to
do.
B
Right
and
I
wanted
on
that
front-
I
know
our
gcs
support
person
noticed
the
transfer
times
taking
longer,
and
that
was
I
didn't
know
how
to
get
that
data
I
was
looking
at
the
console.
Is
there?
Is
that
metric
available
somewhere
I'd
love
to
have
that
at
least
reported,
because
that
would
help
us
distinguish
between
when
gcs.
A
B
Well,
the
other
issue
I
think,
is
to
investigate
authorized
projects
and
just
figure
out
we
can
make
that
much
more
efficient.
I
think
you
talked
about
the
data
database
bloat
issue,
but
I
feel
like
there
may
be
some
quick
wins
in
either
just
not
updating
everything,
or
at
least,
if
you've
already
updated.
You
know
a
user's
authorizations.
B
You
don't
need
to
schedule
another
job
for
that
user
right,
because
a
lot
of
these
things
are
redundant
instead
of
we
have
7
000
members
per
project,
you
don't
need
to
schedule
that
100
times
you
can
just
do
it
once
it's
bad
enough.
It's
doing
once,
but
you
know
100
7
000
jobs
is
much
better
than
700
000
jobs.
B
C
C
No,
they
got
pulled
into
that
project
right,
yeah,.
A
B
Are
we
capturing
that
list
of
like
to
do's
or
issues
we
could
attach
them
to
the
rapid
action,
epic
or
I'm
quite
writing
notes
and
then
we'll
try
to
find
all
the
issues
and
link
them
to
whatever
epic
we
create
okay,
cool?
Where
are
the
notes?
Sorry,
I,
oh
I'm
just
taking
the
miner,
I
don't
know
if
anybody
else
is
taking
those
I
just
I
wanted
to
just
get
this
out
there
right
now
and
then
we
can
put
it
in
the
chat.
B
C
B
So
that
I
guess
what
I
heard
andrea
said
was
that
number
one
we
want
to
break
up
the
sidekick
pools
like
move
some
workers
to
a
different
pool.
That
sounds
like
it's
probably
pretty
much
an
infrastructure
task.
Do
you?
Are
there
folks
in
development
we
should
be
involving
in
that
piece?
No
okay
and
then
removing
carrier
wave,
which
I
just
looked
at.
That's
that
file
upload
library,
yeah.
B
B
Carry
wave
is
a
much
longer
term
project,
yes,
and
we
already.
B
A
B
You
can
change
it.
You
can
change
that
behavior.
You
know.
The
trade-off
is
that
you
might
save
the
entry
in
the
database
before
the
files
persist
in
object
storage,
but
so
there's
some
danger
in
like
something's
there,
but
not
really
there
yet,
but
you
know,
there's
a
the
infrastructure
risk
is
higher,
I
think,
being
a
problem,
so
we
could
change
that
mode.
For
example,
we
could
change
and
say
do
that
and
you
know
we'll
live
with
the
consequences
and
see
what
happens.
D
And
that
that
protects
us
against
against
any
cause
for
for
a
delay
in
in
the
effects,
whether
it's
gcs
or
or
threat
contention.
It's
it's
a
it's
a
more
general
mitigation
effectively
trying
to
minimize
the
duration
of
the
the
the
least
db
connection
from
pg
banner.
That's
great.
That
makes
sense.
Yeah.
A
I
think
another
thing
is
that
this
also
illustrates
like
how
problematic
that
ruby
thread
contention
saturation
metric
can
be,
and
so
we
should
maybe
make
that
like
a
higher
priority.
I
think
at
the
moment
it
just
goes
to
slack.
It
doesn't
alert,
but
the
problem
is
well.
It
will
alert
a
lot
if
we
turn
it
on
now,
so
we
should
fix.
D
Think
about
the
so
the
the
thread
contention
is
is
a
major
contributing
factor
to
to
this
incident.
The
the
resource
that
we
that
we
saturated
was
from
in
terms
of
in
terms
of
the
side
effects
spreading
to
old
sidekick
jobs,
rather
than
just
just
the
ones
with
the
with
the.
D
Yeah,
I'm
kind
of
wondering
what
do
you
think
about?
What
do
you
think
about
having
saturation
for
the
for
the
pg
banter
connection
pools
be
be
a
higher
level
of
alert,
because
I
think
I
think
that's.
A
In
in
the
flood
of
alerts,
then
so
we've
got
four
different
alerts
that
are
all
very
similar:
we've
got
replicas
and
the
primaries,
and
then
we've
got
sidekick
and
then
the
synchronous
web
and-
and
I
think
the
problem
is
that
the
synchronous,
the
asynchronous
one,
the
sidekick
one,
is
saturated.
So
much
of
the
time,
mostly
because
of
archive
trace
worker,
that
we,
we
might
have
it
as
a
lower
level
alert
because.
A
A
lot
because
of
another
known
issue
that
we
want
to
fix:
okay,
so
yeah,
so,
but
if
it's
not
again
it
it,
it
is
a
critical
situation.
D
B
D
B
A
A
Say
sidekick
shard.
Are
you
talking
about?
Yes,
so
those
basically
the
way
that
so
so
at
the
moment
we
have
different
characters.
A
A
But
in
terms
of
the
ruby
thread,
saturation
so
and
the
other.
C
A
C
A
Out
we
basically
ignore
that
from
the
alerts
and
then,
if
we,
if
we
see
saturation
alerts
for
anything,
you
know
while
that's
getting
fixed,
because
one
of
the
things
is
we've
just
been
ignoring
those
alerts
because
of
a
known
bad
cue,
and
if
we
put
that
in
the
sin
bin,
then
you
know
we'll
get
the
alerts
for
hey.
You
know
everything.
You
know
the
urgent
other
queue
is
is
totally
saturated.
C
Okay,
so
that
sounds
great
sorry.
C
Really
hate
to
interrupt
this
discussion,
but
it
would
be
good
to
know
whether
we
can
take
an
action
right
now
on
the
separating
the
shard
out
isolating
the
chart,
because
if
we
have
a
call
it's
in
half
an
hour
to
monitor
the
situation,
it
would
be
best
if
we
can
kind
of
split
off
and
start
doing
that
work.
So
we
can
disarm
that
call
if
possible.
We.
C
To
clear.
C
Correct
scarborough,
how
about
you
and
I
separate
in
a
separate
location-
we
can
do
a
call,
an
issue
or
whatever
and
start
working
on
that
matt.
Would
you
be
willing
to
do
a
sanity
check,
because
I
think
scarborough
will
need
some
help
because
of
the
selectors
that
are
hard
to
actually
get
right.
D
E
So
yesterday
we
saw
this
problem
where
one
of
the
shards
went
down
to
its
minimum
pod
scale,
because
it
was
struggling
to
do
this
work.
The
cp,
because
we
scale
in
cpu
the
cpu
usage
of
that
shard,
was
so
low.
We
just
scaled
down
to
the
minimum.
If
that
is
the
same
situation
for
this
new
shard,
we'll
kind
of
magically
avoid
creating
havoc
across
all
of
sidekick
and
instead
it'll
be
limited
to
just
this
one
char.
That's
the
sin:
bin
shard
per
se.
F
F
Okay,
we
can
address
that,
as
john
has
pointed
out
when
it
comes
to
the
cpu
and
individualizing
that
shard
that
doesn't
directly
solve
the
problem
of
the
pg
bouncer
being
overloaded
by
a
given
shard
set
or
shard
plural
yeah.
So
you
know
we
may
need
to
have
a
small,
dedicated,
pg
bouncer
for
the
problematic
ones.
The
good
news
is,
we
need
to
put
that
into
place,
but
configuring
that,
throughout
the
application,
consumers
is
actually
very
easy.
A
Yeah
so
my
take
on
that
is,
is
I
I
really?
We
don't
have
a
good
grasp
at
the
moment
of
our
pg
bouncer
pool
sizes
compared
to
you
know,
effectively
we're
taking
500
connections
into
our
postgre
into
our
primary,
and
then
we
divide
that
up
between
the
synchronous
and
the
asynchronous
and
and
the
maths
doesn't
add
up
at
the
moment.
A
Frankly,
I
don't
think
I
don't
know
if
you've
ever
looked
at
this,
but
it
it's
pretty
and-
and
that's
with
two
and
if
you
add
three,
that
that's
a
that's
a
good
project,
but
I
don't
think
it's
a
project.
We
do
now
like
that.
That
needs
proper
consideration
and
a
better
monitoring
and
planning
around,
because
you
know,
I
suspect,
if
you,
if
you
actually
laid
it
all
out
now
we
either
kind
of
quite
far
over
or
worse,
I
think,
we're
probably
quite
far
under
and
we're
actually
sort
of
throttling
ourselves.
A
You
know
if
you
look
at
the
number
of
pg
balances
and
the
you
know
whether
we
over
committing
or
under-committing
pg
bounces
to
postgres.
I
I
think
it's
just
been
done
on
an
ad-hoc
basis.
Like
we've
made
a
change
to
pg
balancer.
Here
we've
changed
the
pool
sizes.
Then
we've
made
a
change
to
postgres
over
there
and
I
don't
think
any
of
those
numbers
add
up
at
the
moment
and
it's
definitely
something
that
we
need
to
fix.
But
I
think
it's
like
here
be
dragons.
I.
D
I
agree
with
you,
but
I
guess
my
point
is
the
fact
that
the
fact
that
web
and
api
were
not
affected
by
this
regression
is
entirely
because
we
have
separate
connection
pools
from
them.
If
that
had
not
been
the
case,
this
sidekick
regression
would
have
wiped
the
whole
website
off
offline
for
hours.
D
Exactly
exactly
that,
that's
why
we
introduce
it
so
so
I
think
in
the
same,
in
the
same
way,
even
even
if
we
don't
get
the
numbers
exactly
right,
I
think
there's
a
large
margin
for
error
if
we
steal
some
of
the
connections
that
are
currently
budgeted
to
the
psychic
pool
and
give
them
to
a
separate.
A
A
Carrier
wave
was
the
thing
that
saturated
pg
bouncer,
but
the
thing
that
caused
the
problem
was
the
authorized
project
which
wasn't
actually
consuming
that
many
connections.
So
you
know
that's
also
something
you'd
have
to
take
into
consideration.
If
you
did
that,
so
you'd
have
to
do
some
sort
of
audit
of
of
the
or
you
you
know
what
you
could
do.
We
have
the
sidekick
attribute
tags
and
you
could
tag
all
of
the
the
carrier,
wave
sidekick
jobs
with
the
carrier
wave
tag,
and
then
we
could
use
a
selector
for
that.
D
D
Right-
and
I
think
I
think
you
were
stan
said
earlier-
that
that
it
is
possible
to
configure
carrier
wave
to
to
not
do
to
not
hold
the
db
transaction
open
while
it
does
that
fetch.
So
some
of
those
job
classes
may
already
be
doing
that.
My
am
I
being
too.
B
Hopeful
here,
yes,
the
only
one
that
said
export
only
the
project
export,
because
people
were
saving
like
gigabytes
of
files
and
we
were
timing
out
because
there's
a
there's,
a
trade
out
there
right,
because
if
you
commit
the
issue.
B
D
A
race
condition
between
between
the
two:
if
we
commit
early,
could
we
could
we,
I
mean
there's
a
few
ways
to
address
that,
but
it
may
yeah
there's
a
few
ways
to
address
that.
None
of
them
are
trivial.
A
But,
but
if
we,
but
if
we
can
get
the
the
things
that
cause
the
the
front
end
contention
on
the
threads
into
their
own
sort
of
shard,
where
they
kind
of
won't
do
that
to
the
the
carrier
wave
workers,
you
know
if
we
keep
those
away
from
the
carrier.
White.
C
A
C
A
D
D
Okay,
okay,
so
all
right!
So
I'm
I'm
sold
on
this
as
mitigation
for
the
for
the
for
the
trigger
condition
for
the
last
two
days
of
incidents.
I
do
still
think
it's
important
to
address
the
the
connection
pooling
because
that's
a
that's
a
more
general
class,
a
problem,
and
that
is
the
resource
that
caused
this
to
spread
across
a
broader
scope.
D
B
D
Yes,
I
think
yes,
splitting
the.
What
currently
is
a
shared
pool
for
use
by
all
psychic
job
classes
to
instead
be
a
separate
pool.
I
I
like
your,
I
I
like
your
idea
of
having
a
separate
one
for
care
for
carrier
wave,
but
if
we
fix
carrier
wave
matt.
A
Yeah
yeah
yeah,
but
but
I
mean
I
do
think
and
that
this
is
something
that's
always
a
low-grade
worry
for
me.
I
do
think
doing
a
proper
audit
of
all
of
those.
You
know
the
sum
total
of
connections
that
we
hold
in
pg
bouncer
and
like
it
sounds
like
a
kind
of
project
that
that
we
do
need
to
do
as
well,
and
then
I.
D
I
agree,
I
think
I
think
I
think
that
that's
a
very
important
question
to
ask
and
I'd
be
happy
to
take
that
on
today.
If
that's,
if
that's
useful,
I
think
which.
B
B
D
Sure
yeah,
I
I'm
assuming
that
that's
more
work
than
splitting
the
pool
and
if
that's
a
bad
assumption,
then
we
shouldn't
bother
splitting
the
pool
we
should
just
focus
on
fixing
carrier
wave.
I
I
was
going
to
add,
though,
that
we
don't
necessarily
have
to
quarantine
all
job
classes
that
use
carrier
wave.
We
could
potentially
just
focus
on
the
ones
that
tend
to
be
very
high
volume,
because
the.
B
Ci
phase
has
to
happen,
so
you
could
argue
that
let's
focus
on
the
ci
ones
for
now
as
kind
of
a
mitigation,
but
you
know
obviously
most
of
our
customers
are
not
going
to
be
thinking
about
this
complexity
of
you
know
another
sidekick
pool,
so
we
probably
want
to
try
to
avoid
like
painting
solutions
for
ourselves
that
other
people
can't
benefit
from
one.
A
So
one
of
the
things
that
might
actually
help
there
is,
if
you
had
say,
I'm
just
going
to
use
hypothetical
numbers
now
but
say
you
had
10
round
numbers.
We
had
10
connections
for
the
carrier
wave
the
the
scary
carrier,
wave
workers,
then
what
we
could
say
is
the
hpa
only
goes
up
to
like
15
concurrent
sidekick
jobs
for
those
workers,
and
then
everything
else
will
start
queuing
rather
than
timing
out
waiting
for
a
database
connection.
F
So
two
things
is,
I
think
we
should
be
tagging
the
jobs
that
are
using
the
problematic
code
with
the
attribute
as
andrew
suggested.
We.
I
agree,
though,
with
matt
taking
all
of
carrier
wave
and
going
you
over.
There
is
definitely
problematic
if
we
have
two
specific
problematic
background
workers,
the
one
that's
using
carrierwave,
we
can
split
that
off
specifically
and
limit
it
in
the
most
generic
way
and
most
straightforward
way,
as
andrew
has
pointed
out
just
give
it
a
dedicated
set,
keep
its
hpa
low
to
make
sure
it
can't
max
things
out
on
its
own.
F
F
I
I
have
no
idea
what
of
that
is
anyways.
So
whether
or
not
we
actually
do
the
pg
bouncer.
Now
that
I've
heard
further
discussion
and
caught
up
on
some
more
details,
andrew
points
out
rightly
that
if
we
just
limit
the
problematic
set
of
workers
to
their
maximum
number,
we
can't
saturate
the
pg
bouncer
actually
because
they
won't
be
able
to
open
enough
to
saturate
it
right
that
actually
somewhat
mitigates
the
complexity
of
the
mitigates,
the
complexity
of
putting
something
in
place
to
mitigate
the
problem.
We
have
because
there's
code
issue.
D
The
size
of
the
db-
that's
that's,
certainly
safe,
with
respect
to
avoiding
avoiding
catastrophic
saturation
conditions,
but
the
connection
pool
is
under
normal
conditions,
not
the
constraining
resource.
So
I
think
that
would
reduce
our
our
throughput
for
for
those
jobs
as
well
right.
A
We'd
have
to
be
very
careful
and
figure
out
what
the
normal
you're
on
youtube
matt,
but
yeah
we'd
have
to
make.
A
Have
enough
headroom-
and
we
don't
you
know,
because,
especially
with
the
volume
of
traffic
that
goes
to
archive
trace
worker,
if
we
kind
of
constraining
that
that
queue
will
just
grow
very
quickly?
Yes,
yes,
so
you
definitely
want
to
have
headroom
in
that
in
that
yeah.
D
Agree-
and
I
I
was
just
saying
that
I'm
just
thinking
out
loud
here
trying
trying
to
play
devil's
advocate-
is
that
the
right
word
I'm
trying
to
think
about
other
other
ed's
cases
that
we
did.
Okay,.
F
F
But
yeah
I
mean
we
need
to
have
room
to
stretch,
but
we
do
not
need
to
put
it
so
far
above
our
heads
that
we're
jumping
up
and
down
on
each
other's
shoulders
before
we
hit
the
ceiling
right
there.
We
need
to
be
watching
the
metrics
on
what's
causing
the
actual
load
and
what
are
the
known
pain
points
so
that
we
know
if
we're
hitting
that
tipping
point
where
we've
got
these
backed
up
logs
due
to
gcs
being
slow
thanks
to
carrier
wave.
A
In
this
case,
though,
we
had
that
alert
and
we've
been
ignoring
it
for
a
month
because
of
a
different
piece
of
infra
infradev
technical
debt.
So
we
do
have
the
alert.
That's
why
I
was
talking
about
putting
that
reactive
caching
worker
into
a
sin,
bin
or
or
just
a
cl.
I
mean
we
could
just
change
the
alert
so
that
just
ignores
anything
from
reactive
caching,
oh
no!
You
can't
because
it's
per
shot-
it's
not
per
worker,
but
we
could.
A
A
I
have
to
go
because
I
have
to
say
good
night
to
my
kids,
so
I'm
gonna
jump
off
thanks.
Everyone.
B
So
I
guess
who
can
we
get
on?
This
assassin
was
created
two
months
ago
and
it's
in
the
backlog,
and
I
got
confused
with
sas
with
echo
with
junit.
So
this
is
actually
a
different
problem,
but
I
think
sam
you're
on
this
call
is
there.
Is
he?
Are
you
still
in
this
call?
Here
you
jump
off.
Okay,
I
guess
I'll
bring
him
up,
live.
C
I
think
this
is
something
in
wayne's
territory,
so
might
be
worth
bringing
wayne
first.
C
C
I
also
need
to
drop
off.
It's
8
p.m.
Here,
scarbeck
ping,
me
if
you
need
any
supports
I'll
I'll,
have
my
phone
with
me.
D
The
work
on
the
the
redis
single
queue
project
is,
oh
he's
gone
I'll
ping
him
offline
john.
Why.
D
So
this
this
probe
actually
the
the
stand.
This
was
the
probe
that
I
was
kind
of
showing
right
before
we
started
recording
the
the
one
that
scans
through
the
the
list
of
all
of
the
the
job
entries
in
the
in
each
queue,
so
we're
ready
to
so
we've
turned
off
that
that
probe,
I
think,
a
day
or
two
ago
and
we're
ready
to
we're
ready
to
enable
the
new
probe.
So
I
wanted
to.
B
I
guess
andrew
raised
it
earlier
about.
Are
we?
What
have
we
have?
We
reviewed
that
it's
safe
to
do
now
to
do
that?
Like
are
we?
What
are
we
doing
differently,
then
yeah.
D
D
I
think
we
benchmark
this
at
most
two
milliseconds
per
per
minute
of
cpu
time
yeah,
so
that
seemed
reasonable
to
me,
but
but
yeah
I
just
wanted
to
see
if
that
was
still
a
thing
we
were
going
to
do
today,
especially
since
we
have
kind
of
the
rapid
action
stuff
to
focus
on
as
well
so
I'll
ping,
the
eoc
about
that
and
see
if
they're
comfortable
with
it
john.
E
It
would
be
good
to
ensure
that
I'm
operating
against
the
right
cues
so
far,
while
all
this
meeting
has
been
taking
place,
I've
been
trying
to
figure
out
where
all
these
cues
work
on,
like
which
shards
they
operate
on
so
andrew.
Provided
me
with
a
feature
category
that
I
could
utilize.
Okay,.
E
About
creating
a
shard
that
explicitly
uses
that
feature
category
in
its
query.
Unfortunately,
some
of
these
cues
operate
in
catch-all.
Some
of
them
appear
to
work
in.
Where
do
I
document
it,
and
another
shard
like
these
are
multiple,
multiple
shards
that
these
cues
are
spread
across.
So
it
looks
like
urgent.
E
So
I'm
working
on
a
merge
request.
You
create
one
for
staging
one
for
production.
I
just
need
assistance
in
making
sure
that
I'm
doing
the
selectors
correctly
right.
Okay,
so.
G
Craig
the
school
too,
because
he'll
be
online
in
the
next
hour
or
so
so
he'll
see
that
incoming
when
he
gets
on.
E
D
Was
gonna
say
I
thought
that
we
were.
I
thought
that
we
were
mainly
looking
to
move
just
the
authorized
projects
worker
to
an
isolated
shard.
A
E
E
E
Merge
request:
it
makes
the
query
slightly
smaller
well.
B
B
And
I
I
think
that
might
be
the
one
that's
mostly
used
to
you
know
like
it's
called
whatever
you
share
a
project
or
create
a
new
project.
I'm
not
checking.
B
E
B
Well,
we
can
look
at
our
psychic
logs
and
tell
you
yeah,
yeah
hey.
Could
I
jump
in
with
a
quick
question
unrelated
to
what
y'all
are
talking
about
just
the
root
cause,
the
yeah
the
under
like
what
what's
actually
causing
the
underlying
issue?
Is
there
a
good
summary
of
that
or
I'd
like
to
be
able
to
communicate
that
out
to
like
you
know,
others
sure.
B
Let's
I
mean
since
andrew
summarize
at
the
beginning
of
this
call,
but
do
we
need
to
summarize
that
in
the
issue
now
yeah,
I
might
not
have
been
on
it
at
that
point
sure.
B
Kind
of
have
a
sense
of
it,
but
I
couldn't
like
you
know
if
you
want
to
if
someone
wants
to
say
or
either
point
me
at
it
or
if
just
summarize
real
quick
I'd
be
willing
to,
I
could
type
it
out
and
then
capture
it
someplace.
So.
D
So
matt
do
you
want
to
take
a
shot
at
this
sure
yeah,
so
so
the
so
we've
already
got
a
comment
to
the
effect
in
yesterday's
incident
to
the
effects
of
of
the
connection.
A
D
Just
sorry
which
connection
pool
so
this
is
this:
is
the
the
database
connection
pool?
That's
that
all
sidekick
jobs
share,
but
the
the
new
piece
of
information
that
that
that
folks
uncovered,
while
I
was
asleep
is,
is
that
the
is
this
bit
about
the
contention
in
among
the
the
the
thread
contention
on
oh
gosh?
What
is
it
the
so
having
having
a
flood
of
these?
D
These
authorized
projects,
jobs
was
producing,
was
producing
thread
contention
in
in
ruby
in
among
the
ruby
threads
on
our.
D
In
threads
yeah,
yes,
exactly-
and
this
was
this-
was
implicitly
causing
a
delay
in
processing
other
other
jobs
that
we're
currently
running
on
other
threads
from
from
handling
the
the
storage
io
and
as
a
side
effect
of
the
of
that
of
that
kind
of
recurring
delay.
They
would
hold
their
database
transactions
for
a
much
longer
period
of
time,
which
longer
leases
means
they're,
the
the
pool
becomes
saturated
and
that's
that's.
What
kind
of
led
to
the
propagation
effect.
B
Okay
gotcha,
so
basically
flood
of
authorized
project
jobs
may
be
triggered
by
a
customer,
creates
threat
contention
in
ruby,
which
then
starves
io.
Basically,
I
guess
cpu
cycle
is
needed
for
that,
and
then
the
db
transactions,
slow
down
pool,
becomes
saturated
and
it's
bad
time
all
around.
B
Yep,
okay,
awesome
yeah
I'll
capture
that
in
the
issue
and
I'll
share
it
in
the
channel
because
yeah
I
know
folks,
are
interested
and
yeah.
The
other
thing
they'd
be
interested
in
maybe
is
like
what
exactly
we're
doing
and
I
think
stan
you've
got
a
good
list
there
and
we
can
kind
of
update
that
as
we
go
so
I'll
stop
talking.
D
Thanks
for
bearing
with
me
no
that's,
fine
and
the
so
the
the
the
working
the
the
working
theory
is
by
by
moving
the
the
these
authorized
workers,
and
so
these
authorized
projects
workers
to
a
separate
to
a
separate
sidekick
shard.
They
will
no
longer
compete
with
all
of
the
other
job
classes
that
run
in
that
run
on
the
existing
sidekick
shards
and
therefore
are
much
less
likely
to
induce
the
contention
that
that
was
that
was
indirectly
producing
all
of
the
side.
D
D
B
D
B
D
And
it's
it's
I
mean
just
to
be
clear:
they're
they're
not
doing
anything
wrong.
It's
just
that.
The
way
that
they're
trying
to
use
the
product
is
not
a
way
that
we've
made
the
product
scale
properly.
So.
B
G
My
meeting
is
set
to
start
for
us
to
all
go
join
the
incident
room,
I'm
just
watching
the
postgres
overview
for
idle
idling
transaction-
that's
not
happening
yet.
So
do
we
all
just
want
to
go
jump
over
to
that
zoom
and
we
can
go
muted
screens
off
to
do
whatever
we
want
to
do,
but
you
know
keep
charging
forward
scar
back
with
what
you
need
to
do.
I
just
thought
I'd
check
and
see
if
we
were
okay
dropping
this
one
and
joined
in
the
other
one.
B
G
Zoom
jet
sure
it's
it's
also.
It's
also
on
the
agenda
of
the
or
not
the
agenda,
the
top
topics
of
the
main
slack
channel,
but.