►
From YouTube: Scalability team demo - 2021-04-01
Description
See also Craig's demo: https://www.youtube.com/watch?v=NuamleKHRDA
A
So
the
first
item
on
the
agenda
for
this
week
is
craig,
has
been
doing
some
experiments
with
sidekick
redis.
So
the
redis
that
we
use
for
sidekick
has
been
very
close
to
cpu
saturation
for
a
while
last
week,
igor
matt
and
bob
got
us
back
to
what
was
it
like.
90,
odd
percent,
bob
from
like
99
percent.
B
A
From
from
saturated
to
just
not
saturated
yeah,
so
we
still
need
to
make
more
improvements
there.
So
what
craig's
been
working
on
this
week
is
setting
up
just
a
couple
of
vms
with
a
redis
server
and
then
a
client
that
sort
of
acts
like
our
application,
but
doesn't
actually
do
any
work
in
the
sidekick
jobs.
A
It
still
schedules
jobs
and
the
jobs
sort
of
sleep
for
an
amount
of
time
that
is
sort
of
matches,
the
the
distribution
of
how
long
our
jobs
take
overall
and
then
like
you,
can
treat
those
like
shards,
like
the
shards
we
have,
but
just
with
one
box
like
doing
all
the
the
client
work
so
to
try
and
sort
of
reproduce
what
we're
seeing
in
production.
A
So
it's
not
identical,
but
it's
it's
a
good
way
to
test
out
things
that
are
harder
to
test
in
production
or
would
take
more
work
to
even
be
able
to
test.
So
he's
recorded
a
video
about
that.
I
haven't
watched
it.
Yet
I'm
not
sure
it's
a
great
idea
to
watch
it
in
the
demo,
because
the
demo
will
be
going
on
youtube
and
the
video
is
also
on
youtube.
So
like
it
seems
much
more
efficient
to
to
watch
that
separately.
Has
anybody
watched
it.
A
So
far,
no
okay,
so
in
that
case
I
guess
I
can
jump
to
his
conclusions
once
I
find
them.
A
So,
basically,
for
we
had
sort
of
a
couple
of
speculative
ideas,
but
the
basic
two
options
we
figured
we
had,
both
of
which
require
a
reasonable
amount
of
effort.
What
we
were
calling
psychic
zonal
clusters,
I
think
essentially,
that's
sharding.
Well,
chad
is
kind
of
an
overloaded
term
in
the
psychic
context,
but
having
multiple,
independent
reddish
sidekick
instances,
which
should
work
because
sidekick
doesn't
guarantee
ordering
we
can.
We
have
some
places
in
the
application
where
we
look
at.
A
We
store
things
on
the
sidekick
credits
which
need
to
be
global,
but
we
could
equally
store
those
on
the
persistent
redis
which
would
remain
global
under
this
proposal.
So
sean.
B
One
thing
that
I
was
thinking
in
this
context-
or
maybe
we
should
talk
about
it
later,
but
if
we
are
going
to
move
those
things
redis
instance,
should
we
move
them
to
a
different
reddest
instance.
Just
to
not
add
on
to
the
pile
of
that's.
A
A
This
was
an
incident
from
three
weeks
ago
about
so
when
we
say
redis
in
our
services.
That
means
the
persistent
redis
and
you
know
we
store
a
lot
in
the
persistent
redis.
We
can
shard
that
I
guess
as
well.
So
there
are
a
couple
of
options
there.
A
One
is
to
use
redis
cluster,
where
it
like
handles
that
for
us,
but
we
haven't
got
any
experience
of
doing
that
and
another
one
is
to
sort
of
do
one
manual
split
now
and
obviously,
if
we're
adding
sidekick
stuff
here,
although
we
already
have
some
psychic
stuff
here,
because
I
think
this
is
the.
A
I
can't
even
remember
what
this
one
is
bob.
Maybe
you
remember
but
yeah.
Basically,
the
idea
was
we
could
either
split
out
sessions
or
database
load,
balancing
stuff
from
the
persistent
redis
and
then
the
next
step
after
that
would
possibly
to
be
to
move
one
of
those
to
redis
cluster,
at
which
point
we
can
rest
a
little
easier.
But
yes,
bob
you're
right
like
if
we,
if
we're
solving
the
problem
on
sidekick
redis
by
pushing
more
stuff
into
the
persistent
redis.
C
But
as
long
as
we
can
keep
whole
functional
units
on
one
redis,
it
doesn't
really
matter
which
writers
they're
on.
A
C
B
A
Yes,
I
mean,
if
we
have
to
do
it
twice,
we
will
do
but
like
ideally,
we
would
just
pick
the
right
place
in
the
first
place.
So
as
far
as
I'm
aware,
most
of
the
stuff,
where
we
deliberately
add
something
to
the
sidekick
redis,
doesn't
need
to
be
on
there
like
it's
just
on
there,
because
it's
logical
because
it's
related
to
sidekick,
but
we
don't
do
a
lot
of.
I
was
worried.
We
did.
A
We
would
do
more
of
this,
but
we
don't
do
a
huge
amount
of
poking
at
sidekick's
internal
state
directly,
except
for
a
couple
of
cases
which
I'll
mention
in
a
second
so
yeah.
So
sorry,
one
of
the
options
was
the
sort
of
what
we
call
zonal
clusters.
So
the
idea
was
that
we
could
have
a
redis
in
a
kubernetes,
zonal,
cluster
and
sidekick
instances
in
that
zonal
cluster,
and
I
think
api
nodes
are
going
into
zonal
clusters
soon.
Job
is
that
right,
yeah
they'll
be
going
into
several
clusters
soon.
A
Yeah
so
we'd
have
psychic
nodes
in
there
as
well
all
of
the
sidekick
shards
so
like
catch
all
memory,
bound,
etc
and
they'd
have
their
own
redis.
And
while
we
were
talking
about
that,
we
realized
that
redis
doesn't
need
to
be
in
kubernetes
for
that
which
probably
simplifies
the
operational
work
there.
So
we
could
do
that
because
we
might
never
want
to
put
redis
and
kubernetes
like.
Oh
certainly,
it
would
be
probably
further
down
the
line
for
data
stores,
I'm
guessing
right.
Jav.
D
A
Sorry,
that's
what
I
was
trying
to
get
at
like
there's
no
need
to
couple
a
sidekick,
zonal
cluster
in
with
the
api
nodes
to
redis.
Being
in
that
zonal,
cluster
redis
can
still
be
on
a
vm.
It's
just
a
different
redis
to
the
one
that
we're
using
now.
So
we
have
two
redis
sidekicks
on
vms
and
some
sidekick
nodes
in
one
zone
across
all
psychic
nodes
in
one
zonal,
cluster
use
that
one
and
every
other
sidekick
node,
either
on
vms
or
in
kubernetes,
uses
the
other
one.
A
C
It
could
also
mean
two
different
things,
because
one
thing
it
can
mean
is
to
reduce
the
number
of
jobs
per
second,
and
another
thing
it
can
mean
is
to
reduce
the
number
of
clients
connected
to
redis,
and
that
depends
a
bit
on
whether
it's
like,
I
think
the
original
zonal
cluster
proposal
would
have
all
cues
exist
in
each
cluster
completely
separate.
So
then
the
number
of
well,
the
number
of
clients
goes
down
and
the
number
of
jobs
per
second
goes
down.
C
But
if
you
do
a
functional
split,
then
all
clients
like,
if
say
the
one-
let's
say
the
post-receive
queue-
is
on
a
separate
redis.
Then
all
clients
would
still
have
to
connect
to
the
post
receive
queue.
So
then
you've,
yes,
you've
reduced
jobs
per
second
on
that
one,
but
you
haven't
reduced
the
number
of
clients.
These
are
two
independent.
Strictly
speaking,
those
are
two
independent
variables
that.
A
Yes,
yeah,
that's
right
and
and
for
the
assumption
we
were
going
for
this
was
that
it
would
be
a
full
split,
so
we'd
have
like
a
full
set
like
everything
would
be
processed
in
both
and
there's,
obviously
an
open
question
there
about
what
we
do
about
the
queues
that
are
only
processed
on
vms
because
they
need
nfs,
etc,
etc.
A
But
that's
the
basic
idea,
because
shared
that's
essentially
shared
nothing
as
far
as
psychic
radius
goes,
which
is
nice,
so
the
other
option
was
something
we
talked
about
a
year
ago,
just
over
a
year
ago,
when
we
were
wrapping
up
our
work
on
sidekick,
because
we
have
like
400
cues
that
we
listen
to
in
our
application
and
sidekick
recommends
no
more
than
a
handful
of
queues.
A
So
if
we
considered
that
we
have,
I
think,
seven
shards,
maybe
maybe
one
more
if
you
count
the
vm
catch
all
as
distinct
from
the
kubernetes
capsule.
You
know
what,
if
we
had
essentially
a
q
per
shard,
and
we
could
do
the
same
kind
of
job
splitting
we
do
now
where
some
workers
go
to
one
shard
and
some
workers
go
to
another
shard
and
we
can
control
that
without
application
changes,
but
through
configuration-
and
so
that's
quite
a
big
job,
but
it
is
very
attractive
and
it
from
craig's
results.
A
It
will
lead
to
quite
a
a
big
improvement
because
well,
I
think
jakob
can
explain
why
sticking
with
a
single
redis
instance
but
drastically
reducing
the
number
of
queues
improved
so
much
jacob.
Do
you
want
to
go
through
that
sure
I'll
stop
sharing
and
you
can
see.
C
Yeah,
it's
easier.
If
I
share,
let's
see,
prepare
my
browser
here
there
flame
graph.
C
This
is
a
flame
graph
I
captured
earlier
in
the
week
on
the
reddit
psychic
primary,
and
I
I
just
I
don't
think
there
was
something
specific
going
on
there.
I
just
wanted
to
look
in
detail
at
something
I
noticed
in
multiple
flame
graphs,
which
is
that
these
things
blocking
pop
generic
commands
and
handle
clients
blocked
on
keys
are
very
big.
C
That's
where
a
lot
of
the
time
goes.
So
what
I
did
then
is
to
look
up
what
those
things
do,
for
instance,
blocking
pop
generic
commands.
If
you
look
oh
now,
just
sorry,
I
tried
to
get
to
the
tab
so
that
this
zoom
thing
comes
down
and
I
can't
click
a
tab.
I
don't
know
that
blocking
pop
here
we
go
so
the
problem
with
these
is
that
they
start
here
right,
you
see.
That's
you
see
the
first
instance
of
the
problem.
C
So
for
360
cues
it
looks
if
the
keys
exist
and
if,
if
the
key
exists,
then
it
tries
to
get
work
off
the
key
and
returns
early
and
if
not,
it
goes
into
this
block
for
keys
function,
and
that
is
in
here
and
again
that
does
a
loop
over
all
the
keys.
So
both
these
functions.
So
that's
the
that's
roughly
the
call
stack
of
blocking
pop
generic
command.
C
So
this
whole
chunk
is
proportional
to
the
number
of
keys.
So
if
we
go
from
or
where
keys
are
queues.
So
if
we
go
from
having
360
cues
on
the
catch
hole
that
this
loop
is
going
through
to
one,
then
these
things
would
shrink
and
the
same
argument
goes
for
handle
clients
blocked
on
keys,
because
that
one
is
also
in
here.
C
Clients
so
that
loops
over
the
ready
keys
and
then
yeah
the
interesting
one
here.
Sorry,
I
didn't
super
prepare
this.
It's
right.
You
can
see
here
that
spends
most
of
its
time
in
unblock
clients.
So
I
need
to
look
for
unblock
client
and
block
client
and
you
then
end
up
in
unblocklined
waiting
data
and
that
one
is
again
iterating
over
well,
it
needs
to
destroy
all
them,
see
this
this
thing
here.
C
The
client
stores
the
list
of
keys
that
it's
blocked
on,
so
that
is
360
keys
for
catch-all,
and
then
it
iterates
the
dictionary
and
it
looks,
looks
all
these
things
up
and
needs
to
update
these
things.
C
A
Okay,
so
marin,
the
basic
idea
here
is
that
you
don't
have
to
explain
it
for
me,
like
I'll
rewatch.
This
recording
three.
A
The
cpu
usage
on
redis
in
the
section
we're
looking
at
in
the
flame
graph,
is
proportional
to
the
number
of
clients,
essentially
multiplied
by
the
number
of
keys
that
each
client
is
listening
to.
So
we
have
different
plans
that
listen
to
different
number
of
keys,
but
we
have
a
bunch
of
catch-all
clients
that
listen
to
about
350,
odd
keys.
A
So
if
we
can
get
that
300
well,
this
is
this.
Is
this?
Is
the
next
part?
So
the
single
q
per
shard
work
is
quite
big
like
to
get
to
the
end
state,
but
I
think
we
can
get
most
of
the
benefit
by
focusing
on
getting
catch
all
to
a
single
cue
per
shard,
because
a
it's
the
biggest
in
terms
of
number
of
queues.
There
are
350.,
it's
it's
got
a
ridiculous
config.
As
java
knows
it's.
A
We
already
listened
to
the
default
queue,
so
this
is
the
only
chart
where
we
wouldn't
actually
need
to
add
a
cue
to
listen
to,
like
you
know,
adding
one
isn't
a
huge
deal
considering
we
already
have
400
plus,
but
you
know
it's
a
neat
thing
operationally,
because
we
could
start
making
this
config
change
to
say,
like
some
work
starts,
going
to
the
default
queue
and
not
make
any
other
change
to
our
catch-all
nodes
like
they
could
continue
listening
to
the
same
set
of
queues,
and
they
would
still
pick
up
that
work
and,
in
terms
of
the
total
we'd,
go
from
listening
across
all
shards
from
about
420,
odd
queues
to
about
50
odd.
A
So
it's
basic.
It's
pretty.
You
know
it's
an
order
of
magnitude
just
from
focusing
on
that
shard
and
craig
did
an
experiment
which
I've
linked
to
in
the
agenda,
doc
where
he
tried
that
option.
So
he
tried
single
cube
per
shard
and
no
just
regular.
What
we
have
plus
catch-all
is
one
cube
essentially,
and
we
got
about
three
quarters
of
the
gains.
A
So
in
his
case
he
saw
a
30
drop
in
cpu
usage
compared
to
40.
If
it
was
one
q
per
shard
everywhere,
so
that
seems
like
the
best
place
to
start
with,
in
terms
of
like
that's
going
to
be
the
quickest
to
implement
and
it's
going
to
get
us
most
of
the
gains
that
doesn't
mean
we
should
stop
like
once
once.
If
we
get
to
that
point,
we
should
then
seriously
evaluate.
A
A
A
yeah
and
the
other,
the
other
nice
thing
about
that
is
most
of
the
catch-all
cues
well
so
say
nice
most
of
the
casual
cues
are
boring
and
some
of
them
don't
even
do
any
work,
but
not
doing
any
work
still
contributes
to
that
flame
graph.
Jacob
showed
because
we
still
block
clients
on
each
of
those
cues.
That
does
nothing,
and
in
the
past
we
have
removed
some
cues
from
our
processing.
A
F
A
B
Looked
it
up
no
wrong
job
queues
are
going
to
be
super
quiet,
there's
like
jobs
that
run
once
a
week
that
kind
of
stuff.
A
Right,
that's
a
good
point.
So
I've
looked
at,
like
you
know,
over
over
a
day
that
I
looked
at
in
prometheus,
we
had
like
a
maximum
of
like
190
cues
processing
over
any
given
minute
in
catch
all,
but
that
doesn't
mean
that
they're
the
same
190.
I
was
just
looking
at
the
the
peak,
so,
like
bob
said
like,
if
you
have
crown
jobs
that
run
once
a
day
at
different
times,
you'll
have
a
different
crown
job
in
each
of
those
but
yeah.
A
Yeah
exactly
and
sorry
marin,
do
you
ask
your
question.
F
A
F
Usage
only
doesn't
do
the
work
because
we
run
this
manual
in
production
database
console
or
rails
console.
Rather
so
you
have.
We
have
a
developer
who
is
dedicated
to
every
week
going
in
there
running
a
script.
Look
at
it
looking
at
it
run
for
two
days.
I
think
and
then
like
seeing.
If
something
fails.
So
that's.
C
A
A
The
other
case
is
that
there's
a
worker
that
we
accidentally
don't
do
anything
with
anymore,
but
we
forgot
to
delete
and,
like
that's
like
technical
debt,
it
shouldn't
cause
us
a
performance
issue
like
it
is
at
the
moment,
so
we
would
like
to
get
that
back
to
relegate
that
back
to
technical
debt
status
instead
of
technical
debt
plus
contributing
to
psychic
cpu
psychic-related
cpu
saturation.
A
A
I
haven't
written
it
up,
but
there
are
a
couple
of
complications
and
I'd
just
like
to
talk
through
those
quickly,
because
I'm
sure
there
will
be
more
but
I'd
like
to
talk
through
the
ones
I
know
about
one
that
is
I'm
hopefully,
I'm
hopeful
is
mostly
tedious
work
rather
than
anything.
That's
particularly
complicated
is
that
our
monitoring
is
by
cue,
not
by
worker,
so
like
our
dashboards
are
by
q,
not
by
worker,
but
we
do
have
the
worker
label
on
our
metrics.
A
So
for
most
of
those
we
can
switch
over
to
worker
and
then
we
switch
the
recording
rules
to
worker
and
the
dashboards
and
the
alerts
to
worker
instead
of
queue,
because
if
everything's
in
the
default
queue
and
an
encore
gets
an
alert
about
the
default
queue
that
tells
them
basically
nothing
because
there
will
be
350
workers
sharing
that
queue.
So
that's
not
going
to
be
helpful
to
anybody.
A
The
one
thing
we
can't
get
easily
with
that
is
cue
depth,
because
at
the
moment
we
can
say,
like
you
know,
the
the
q
for
merges
is
10
deep
and
the
q
for
post
receive
is
five
deep,
but
if
both
of
those
are
in
the
same
queue,
we
get
that
from
git
lab
exporter
anyway.
So
I
guess
we
could
inspect
the
queue
and
see
what
the
depth
is
for
each
worker.
But
that's
I
don't
know.
I
haven't
really
thought
that
one
through
yet.
C
B
C
The
observability
story
it
has,
it
has
implications
for
observability,
and
that
is
yeah
one
of
the
areas
where
most
of
the
work
would
be.
B
C
B
F
A
And
so,
like
I
said,
some
of
these
observability
things
like
anything
anything
that's
exported
by
the
application,
has
both
queue
and
worker,
so
we
can
mostly
switch
those
over
to
workers.
They
can
tell
us
like
what
rate
a
worker
is
processing
at
what
rate
we
are
queueing
jobs
for
a
worker.
What
the
error
rate
is
etcetera.
We
can,
you
know
we
can
map
those
all
over
fairly
easily.
A
What
we
do
need
to
look
at
is
any
other
metrics
we
get,
which
are
probably
from
gitlab
exporter,
which
are
probably
using
the
sidekick
api,
which
is
based
on
queue
not
on
worker,
but
it
will
still
be
useful
to
know
what
the
depth
of
the
default
queue
is.
I'm
not
saying
we
should
replace
them.
I'm
saying
we
should
add
just
to
be
clear
and
also
because
we'll
have
the
separate
cues
for
the
other
shards.
C
So
what
bob
just
touched
on
is
that
there
is
a
conceptual
downside
to
having
fewer
cues,
which
is
by
having
more
cues.
We
can
have
more
or
less
have
fairness,
guarantees.
A
Yes,
it's
like,
like
you,
know,
passing
the
command
line
passing
to
the
command
line.
Like
you
tell
sidekick
a
psychic
process,
listen
to
these
cues,
so
that's
the
natural
unit
and
I
think
it
was
you
know,
probably
a
reasonable
decision
when
we
did
it
and
it's
also
easier
to
add
cues
than
remove
cues.
So
that
explains
why
we've
not
done
this
before
it's
like.
A
You
know
it's
much,
it's
harder
to
go
in
the
opposite
direction
and
that's
why
I'm
using
default
as
kind
of
a
trick,
because
we
already
have
the
default
queue,
so
we
don't
have
to
remove
cues.
We
just
have
to
shift
work
to
default
and
then,
when
we're
confident
that's
working,
we
can
stop
listening
to
the
other
cues.
A
So
we
can
do
it
in
a
safer
way,
because,
obviously
any
major
change
to
our
psychic
config
is,
you
know
risky
like
we
could
lose
jobs
and
the
same
applies
to
the
zonal
clusters.
Work.
Of
course
anything
else.
D
Yeah,
just
just
one
comment
about
that:
we
should
probably
prioritize
the
known
cues
and
catch
all
that
don't
rely
on
nfs
to
begin.
B
D
To
do
yeah,
yeah
and
then
the
other
thing
I'll.
Just
echo
the
concerns
about
you
know
we
would
be
increasing
the
blaster
ideas
for
failures
like
right
now.
Our
cues
are
fairly
isolated.
If
there's
a
problem,
there's
a
queue
that
depends
on
an
external
dependency
and
it
starts
to
back
up
it's
limited
to
that
feature
and-
and
we
have
even
in
catch-all,
I
think
we
have
like-
maybe
some
external
dependency
external
dependencies.
D
D
A
Sorry,
I
completely
missed
a
part
of
the
top
job
and
I'll
just
explain
that,
because
that
I
don't
think
it's
going
to
completely
answer
your
question,
but
I
think
it
is
going
to
partially
answer
it.
So
the
idea
here
is
that
there
is
a
configuration
that
lets
you
move,
which
queue
a
worker
puts
its
jobs
in,
so
each
worker
has
its
own
queue
name,
and
the
idea
is
here
that
the
configuration
would
let
you
say
to
that
worker.
Don't
put
it
in
your
own
queue,
put
it
in
the
default
queue.
A
You
know
we
have
the
option
to
like
go
back
to
having
a
dedicated
queue
for
I
don't
know
or
say
we
put
authorized
projects
in
default
and
then
we
were
like.
Oh
no.
This
is
a
terrible
idea.
We
can
put
it
back
into
its
own
queue
under
this
model.
We're
not
actually
deleting,
I
guess,
cues.
At
this
point.
We
are
just
shifting
yeah.
C
Of
the
questions
I
have
here
and
jarv
you
maybe
know
more
about
that,
is
how
often
do
we,
so
what
the
current
system
is
very
flexible
and
you
can
a
lot
of
stuff
can
be
in
the
queues
and
you
only
make
a
choice
of
where
it
goes
in
the
server
config.
So
we
have
a
lot
of
flexibility,
but
I
don't
know
if
we
use
it
and
if
you
start
sorting
jobs
to
different
charts
before
they
go
into
the
queue
into
their
respective
queues.
C
It's
much
harder
to
reorganize
those
queues
because
you'd
have
to,
I
don't
know,
run
some
reels
command
script
that
I
think
we
sometimes
do
that
like
right,
where
we
pop
through
a
queue
and
we
toss
out
jobs
or
something,
but
it's
about
when
you
need
to
it's
about
deciding
early
deciding
eagerly.
So
this
current
system
is
lazy,
but
that
concentrates
a
lot
of
work
on
redis
and
if
we
have
an
eager
system,
then
it's
more
efficient,
but
you
also
you're
making
these
choices
earlier.
D
It's
it's
not
something
we
do
often,
and
I
think
we
we
basically
rely
on
having
like
lots
of
these
queues
and
isolated
workloads
based
off
of
queue.
If
this,
if
we
change
that
model,
then
we
would
definitely
want
to
improve
our
manage
management.
C
So
in
the
example
of
authorized
projects
where
it
creates
a
crazy
amount
of
jobs,
if,
if
we
sort
jobs
early
like
sean
said,
it
should
still
be
a
tweakable
model
where
either
based
on.
I
guess
we
still
keep
the
selectors,
so
we
could
deploy
a
config
change
where
we
say
we
maybe
would
even
be
able
to
instantiate
a
new
chart
or
just
a
separate
queue
where
we
say.
Okay,
this
stuff
is
a
problem
right
now
it
goes
in
the
other
queue,
but,
of
course,
that
only
changes
jobs
that
are
submitted
after
the
conflict
change.
C
It
doesn't
change
job,
a
problem
where
jobs
are
already
in
the
queue
like
we
have
a
hundred
thousand
jobs
in
the
queue,
and
we
need
need
to
do
something
about
them.
So
does
that
matter
practically
or
is
that
not
something?
I
think
you
just
I'm
sorry,
I'm
restarting
my
question.
I'm
just
trouble
checking
that
how
we
use
the
this
flexibility
just.
C
C
Exactly
so
and
and
certain
things
can
be
reproduced
like
in
the
current
model,
we
could
deploy
a
conflict,
change
and
say:
stop
processing
authorized
projects
because
everybody
else
is
starving
for
cpu
time,
but
that
is
a
very
clunky
way
of
making
a
change,
and
I
don't
know
if
we
even
do
that
or
if
we
should
be
doing
things
like
that.
C
But
even
if
we
have
a
model
where
we're
sorting
early,
we
could
still
do
these
sort
of
things
with
a
server-side
middleware
where
a
job
comes
in
and
we
would
have
some
sort
of
dynamic
conflict
that
says.
Oh,
this
is
the
authorized
projects
worker
I'm
going
to
re-cue
this
on
a
separate
queue
like
we
have
an
overflow
chart
or
something-
and
I'm
going
to
redirect
this,
because
this
should
not
bother
anybody
else.
So,
yes,.
A
There's
a
small
delay
in
processing
like
jobs
that
need
to
be
requeued,
but
obviously,
if
you're
making
jobs
be
re-queued,
then
you
probably
want
a
delay.
F
F
F
F
That
will
do
something
change
of
of
not
only
configuration
but
a
change
of
architecture,
of
how
we
do
these
things
so
like
is
this.
F
F
C
B
Jobs
per
seconds
where
people
are
so
mike
mike
mentioned
in
the
issue
that
he
got
created
when
we
were
discussing
the
timeout
thing
and
then
there
he
mentioned
something
like
25
000
jobs,
a
second
in
an
issue
so.
A
So
my
initial
instinct
marin
there
is
that
if
we
can
make
the
horizontal
shouting
story
work,
then
that
is
probably
the
the
next
step
after
we've
reduced
the
number
of
cues
to
whatever
level
we
determine,
but
that
you
know
I
did
use
that
magic
word
if
at
the
start
of
that,
so
there
are
some
cases
like
the
admin
apis.
We
have
like
the
run
books.
We
have
to
delete
jobs
from
a
queue
that
don't
work.
A
Yes,
because
then
everybody
has
to
know
about
all
of
them,
whereas
the
nice
thing,
but
also
the
trap
with
the
the
other
split,
is
that
you
can
say
no
an
application
instance
only
needs
to
know
about
one
redis
sidekick,
but
then
it
only
knows
about
one
reddish
sidekick
so
like
it
can't
see
what's
going
on
on
the
other
instances,
but
hopefully
we
can.
We
can
think
about
these
questions
over
the
time
that
we
work
on
this
as
well.
C
Craig
said
the
same
as
you
marin,
by
the
way
in
at
the
bottom
of
his
comment,
he
said.
B
C
This
is
his
conclusion,
is
reducing
number
of
queues
seems
like
the
biggest
win,
but
that's
a
one-time
thing
we
can
do
and
we
need
to
think
longer
term
what
we
can
do
about
growth.
A
A
A
I
also
think
some
of
the
answers
to
this
question
will
be
affected
by
the
other
work.
That's
happening
now,
so
I
know
we're
looking
at
like
not
relying
on
a
single
global
database
if
we
can,
which
is
its
own
can
of
worms,
but
like
we
probably
don't
want
a
completely
different
model
for
how
we
manage
our
database
to
how
we
manage
our
job.
We
probably
want
different
models,
but
what
I'm
saying
is
like
if
the
database
ends
up
being
sharded
by
it's
likely.
C
A
Then
we
would
probably
do
the
same
for
sidekick
or
look
at
least
look
to.
You
know
work
with
that,
whereas
you
know
that's
a
very,
very
good
answer,
but.
F
A
Because-
and
I
think
this
is
another
advantage
of
doing
this-
first-
is
that
this
is,
however,
we
shared
in
future.
Reducing
the
number
of
queues
will
increase
our
headroom
on
whatever
shards
we
end
up
with,
because
we
won't
be
listening
to
a
bunch
of
queues.
So
you
know
we'll
have
we'll
have
a
bunch
more
space
available
to
us
so
and
also
this
one's
harder
to
do.
If
we
have
started
charting
already,
because.
A
D
My
from
my
perspective,
I
would
prefer
to
hold
off
on
charity
and
redis
by
cluster,
just
because
my
my
main
concern
with
that
activity
is
our
inability
to
drain
an
entire
cluster,
which
is
something
we
take
advantage
of
now.
D
If,
if
we
don't
have
horizontal
skip
horizontal
scaling
for
redis,
then
if
you
offset
the
load
like
if
you
drain
a
cluster,
that's
going
to
put
pressure
on
the
other
radiuses
and
could
cause
a
problem.
So
I
would
much
prefer
we
focus
on
like
reducing
the
number
of
cues.
If
we
think
we
can
get
some
cpu
savings.
D
I'd
rather
do
that
first
personally,
just
because,
especially
with
the
migrations
that
we're
doing
now,
we're
still
kind
of
in
the
early
phases
of
feeling
out
kubernetes
and
we've
used,
you
know
draining
clusters
quite
a
bit
since
we've
done
the
migration.
This
will
probably
stabilize
over
time,
though.
D
A
F
D
I
think
the
problem
is
also
if,
even
if
it's
in
separate
clusters
you're
going
to
be,
you
would
still
have
a
one-to-one
relationship
between
cluster.
F
F
The
the
purposes
mixed
within
that
cluster,
so
that
you
can
yeah,
I
see
what
you
mean:
okay,.
A
Would
this
be
a
fair
like
way
of
framing
part
of
the
concern
there
is
like?
If
we
have
two
zonal
clusters,
then
then
draining
one
of
them
is
a
big
problem.
If
we
have
10,
then
it's
probably
not
as
big
a
deal
because,
like
you
know,
the
other
nine
can
pick
up
the
rest
of
that
so
like.
If
we're
at
any
clusters,
don't
buy
us
as
much
as
we'd
hope.
D
Yeah,
I
think
if
we
had
ten
zonal
clusters,
we
probably
would
be
able
to
absorb
the
additional
load,
but
I
don't
know
if
that
would
happen
right.
D
A
D
Right
and
we
have
three
right
now-
we
could
conceivably
increase
to
five.
Maybe,
but
I
wouldn't
imagine
just
going
beyond
that,
though
yeah.
A
C
Oh,
I
wanted
to
briefly
mention
that
the
pack
objects
cache
is
very
close
to
being
enabled,
but
it
basically
is
waiting
for
some
omnibus
code
to
reach
production
and
the
deploy
failed
two
days
in
a
row,
so
we're
hoping
every
day,
I'm
not
looking
to
see
if
it
reaches
production
and
matt
smiley
is
poised
to
to
make
the
change.
C
C
And
one
thing
that
was
interesting,
slightly
scary
for
me,
but
I
was
I
I
happened
to
see
an
incident
yesterday
where
about
slo
aptx
slo
violations
on
philo
4,
which
is
not
a
gateway
server
I
often
hear
about,
but
I
just
peeked
over
people's
shoulders
like
oh,
that
looks
that
sounds
related
and
I
was
just
talking
the
feature
flag,
so
I
could
see
the
cache
keys
for
the
back
objects
things.
It
was
looking
up
and
there
were
crazy
numbers
of
repetitions.
C
So,
and
so
everything
looked
again
like
this
is
the
classical
ci
problem.
So
it
would
have
been
really
fun
to
turn
it
on.
In
the
middle
of
that,
if
I
write
that
that
was
it
but
yeah,
I
did
something
else
to
make
it
stop.
But
it's
just
a
reminder
that
it's
not
a
problem
that
is
unique
to
falcon
area.
F
A
Yeah,
so
just
quickly,
bob
mentioned
a
really
good
complication
yesterday,
which
is
that
one
nice
thing
about
the
q
per
worker
model
is
that
with
a
mixed
deployment
like
we
have
now,
where
canary
can
be
scheduling
jobs
into
cues.
That
sidekick
doesn't
know
about
yet
for
workers
that
psychic
doesn't
know
about
yet
because
they're
in
their
own
queue
and
psychic
won't
be
listening
to
that
queue.
A
That's
fine,
like
you
know,
if
you
schedule
some
jobs
from
canary
that
don't
exist
in
production,
they
just
don't
get
processed
until
production
sidekick
is
updated,
but
in
the
new
model,
if
they
go
to
the
default
cube,
they
will
fail,
because
we
can't
find
the
constant
for
the
worker
because
it's
not
been
deployed
yet.
So
I
think
there
are
a
couple
of
options
there.
One
would
be
to
have
a
very
specific
config
that
we
update
every
time
to
not
move
those
workers
to
the
default
queue
until
they're
ready.
I
don't
like
that.
A
Another
is
to
have
some
kind
of
custom
retry.
So,
ideally,
we
would
just
use
psychic
retries
for
this,
but
we,
a
while
ago,
set
our
default
retries
to
three,
which
happened
in
about
two
minutes:
psychic
defaults
to
25
retries
that
happen
over
the
course
of
like
a
couple
of
weeks.
A
B
F
A
C
Then
it
it
should
get
more
retries
yeah.
A
A
Basically,
I
don't
think
the
retries
is
anywhere
near
the
top
of
our
priority
list
in
general,
it's
just
a
bit
frustrating
that
it's
it's
limiting
us
in
this
particular
case.
A
I
think
those
were
the
only
two
complications
or
the
only
three
complications.
I
guess
with
the
one
that
jacob
mentioned
about
the
the
workload
shifting
that
I
wanted
to
talk
about
on
the
call,
so
the
metrics
and
observability
workload
shifting
and
how
this
works
with
our
deployments
did
anybody
else.
B
B
It
kind
of
it
kind
of
ties
into
the
metrics
and
observability,
but
sometimes
we
like
the
workers
I
brought
up.
We
have
this
self-throttling
thing
that
workers
do
that
check
their
own
queue
size.
Yes,
so
we
will
have
to
re-implement
that
using
something
that
can
be
shared
like
that
can
be
saying
shared
state.
But,
like
you
know
what
I
mean.
A
B
Yeah
and
I'm
not
just
talking
about
the
limited
capacity
things
that
I
was
working
on
with
the
ci
teams
in
the
package
teams,
but
also
global
search,
does
something
like
that
yeah.
I
linked
the
workers
that
I.
B
A
That
so
for
global
search,
we
can
kind
of
punt
on
that
because
we're
not
proposing
to
rename
their
cues
to
default
just
yet.
But
it's
definitely
worth
thinking
through.
So
thanks
for
bringing
that,
because
there
were
a
couple
that
you
mentioned,
that
are
would
go
into
default
and
we
would
need
to
handle
somehow.
C
I
think
this
is
still.
I
just
want
to
emphasize
it's
really
nice
that
we
can
try
this
all
out,
but
only
focus
on
part
of
the
queues,
because
this
lopsided
thing
where
default
is
so
huge
because
we'll
discover
so
many
problems
that
we
don't
know
about
yet
and
learn
how
to
deal
with
them
and
have
some
flexibility
in
dealing
with
them
or
not.
B
Yeah
one
question
that
I
had
in
the
dock
was
like,
as
you
mentioned,
I
do
like
the
idea
that
we
can
just
move
some
stuff
over
to
default
and
then
like
remove
some
cues
already,
but
I'm
wondering
if
we
should
like
if
we're
going
to
be
playing
for
the.
Is
this
a
configuration
thing
or
is
it
going
to
be
like
a
hard-coded
thing
like
our
workers
opting
into?
A
Initially,
everything
will
continue
to
use
its
own
queue
and
we
will
allow
you
to
opt
in
through
configuration
to
saying
workers
matching
this
queue
selector
or
this
worker
selector.
I
guess
it
would
be
now
or
yeah
virtual
cue,
selector,
let's
you
know
whatever
we
call
it
go
to
this
cute
concrete
queue
instead.
A
C
It's
important,
I
think,
to
do
this
whole
thing
in
a
way
where
it's
configurable
and
we
don't
change
everybody
else's
config.
While
we
figure
out
what
works
well,
yeah.
A
A
Exactly
because,
because
migrating
for
us
is
just
basically
going
to
mean
listen
to
those
cues
until
they
stop
processing
at
least
initially
and
then
stop
listening
to
them
and
maybe
do
something
special
for
scheduled
jobs
and
that's
that's
kind
of
it.
Whereas
if
we
need
to
like
migrate
it
for
self-managed
users,
then
we
actually
need
to
put
in
the
work
there.
But
if
we're
specifically
focused
on
gitlab.com
and
specifically
focused
on
the
catch-all
shard
z,
then
we
don't
need
to
do
that.
So
yeah.
C
A
A
All
right,
if
nobody
has
anything
else,
I'm
going
to
wrap
this
up
and
then
I'm
going
to
watch
craig's
video
about
his
demo.
His
experiments,
because
that's
super
interesting
as
well
and
yeah
have
a
good
day.