►
From YouTube: Scalability team demo 2021-05-12
A
So
yeah
there's
only
a
couple
of
us
so,
like
I
said
I'll,
just
do
a
kind
of
speed
run
of
the
two
items
I
had
and
then,
if
that's
it,
then
it's
a
reasonably
short
video
for
other
people
to
watch.
A
So,
first
of
all,
we're
working
on
being
able
to
migrate
from
having
one
queue
per
worker
to
or
one
worker
per
queue
to,
multiple
workers
per
queue
so
that
we
can
drastically
reduce
the
number
of
queues
we
listen
to
on
sidekick
in
production,
which
should
give
us
a
big
reduction
in
the
num
in
the
cpu
saturation
as
well,
because
the
current
cpu
saturation
is
basically
some
combination
of
the
number
of
queues.
A
We
listen
to
the
number
of
clients
we
have
and
the
distribution
of
work
within
those
queues
and
of
those
the
easiest
one
for
us
to
affect
downwards
is
the
number
of
cues.
We
listen
to
so
wang
men's
done.
Most
of
the
work
on
this
in
terms
of
allowing
queues
to
be
sorry,
workers
to
be
routed
to
different
queues,
so
by
default
a
worker
will
go
to
its
own
named
queue
as
before,
but
you
can
set
some
configuration.
A
That
will
say
actually
this
job
should
go
to
this
queue,
and
so
then
the
migration
becomes
fairly
sort
of
simple.
To
think
about.
You
set
those
configuration
changes,
you
wait
for
the
old
queue
names
to
drain,
and
then
you
stop
listening
to
those
cues,
but
there
is
a
wrinkle
which
is
oh.
I
need
to
remember
what
this
set
is
actually
called:
oh
gosh,
so
there
is
a
wrinkle
which
is
this
in
sidekick.
A
There
are
two
sets,
or
there
are
actually
four,
but
there
are
two
that
we
care
about
of
jobs
that
are
sort
of
global.
So
we
have
this
sorted
set
for
scheduled
jobs
in
the
future
and
we
have
assorted
set
for
jobs
to
be
retried
and
there's
also
interrupted
and
dead
jobs.
But
we
don't
we
don't
care
about
those
so
much,
because
we
don't
really
do
anything
with
those
on
gitlab.com.
A
In
those
sets
the
I
entries
have
gosh.
This
is
about
explanation.
Those
entries
contain
the
queue
name
as
part
of
the
json
payload.
So
if
I
take
a
look
in
here,
you
can
see
this
is
going
to
the
background
migration
queue
and
this
is
going
to
the
background
migration
queue.
So
what
that
means
is
if
we
just
stopped
listening
to
those
queues
and
then
something
gets
popped
off
the
scheduled
set.
That
would
go
to
a
queue
that
we
don't
listen
to,
and
hopefully
we'd
alert
on
that,
but
you
know
we'd
lose
it.
A
A
First,
so
yeah
I'll
just
talk
through
quickly
what
it
does
so,
basically,
it
just
uses,
scan
or
said
scan
because
it's
a
sorted
set
steps
through
the
set,
looks
it
just
uses
the
the
queue
configuration
that
you've
already
set
up,
the
worker
configuration
for
the
routing,
so
it
just
says
like
make
the
scheduled
set
queue,
names
and
worker
names
match
the
queue
names
and
worker
names
from
the
routing
configuration.
A
I
was
thinking
about
allowing
arguments,
but
I
think
it's
just
simpler
to
just
say
you
just
run
this
and
if
your
configuration
is
what
you
want
it
to
be,
then
it
does
what
you
want
it
to
do.
There
are
20
000
jobs
in
the
scheduled
set,
I
have
locally,
and
that
took
a
couple
of
seconds.
There
are
like
60
000
in
production,
but
obviously
nothing
else
is
using
this
redis
at
the
moment.
A
I'm
not
even
running
sidekick
right
now,
I'm
just
running
redis
and
a
console,
so
it
might
take
longer
in
production,
but
that's
the
basic
idea
so
that
there's
also
a
equivalent
task
for
the
retry
set.
So
the
idea
there
is,
you
have
sort
of
add
a
couple
of
steps
to
that
migration.
A
So
it's
like
set
up
the
new
configuration
run,
these
tasks
to
migrate,
the
scheduled
sets
and
the
retry
sets
wait
for
the
cues
to
drain,
stop
listening
to
the
old
cues,
so
yeah,
that's
the
basic
idea
there
and
that's
in
review
now
so,
hopefully
that's
that
can
be
merged
soon.
Yeah.
Any
questions
on.
B
That
for
the
right
task,
did
you
I'm
I'm
curious
how
how
long
the
longest
single
redis
operation
lasts
was.
Was
it
it
wasn't
the
entire
multi-second
duration?
Was
it.
A
Oh
no
right
so
yeah.
Sorry,
I
meant
to
explain
that
when
I
started
saying
it
scams
and
then
I
stopped
so
yeah
good
question,
so
it
uses
zed,
scam
which
is
or
z
scan,
let's
just
say,
scan,
which
is
01.
A
I
think
it
uses
then
to
oh
log
n
operations,
which
is
the
remove
and
the
add,
slightly
annoying
to
me
that
you
can't
remove
by
key
and
by
score,
so
I
just
have
to
remove
by
key,
because
obviously
there
could
be
several
jobs
scheduled
at
the
same
time
and
I
might
only
want
to
remove
one
of
them
and
we
can't
edit
it.
A
It
does
also
check
because,
like
this
set
represents
work
that
can
be
popped.
It
does
check
just
you
know
there
is
the
return
result
of
what
happens
when
we
try
to
remove
it
and
says:
okay.
If
we
tried
to
remove
that
and
nothing
got
removed,
don't
add
it
back,
so
we
don't
double
schedule
the
same
job
and
yeah.
I
guess
technically,
you
might
need
to
run
that
rate
task
twice.
I
forget
what
the
scan
guarantees
are
on
like.
A
A
Yeah
that
makes
sense
so
yeah.
That's
the
basic
idea:
I've
written
up
some
administrator
documentation
for
this,
but,
like
I
said
it's
still
in
review,
but
I
also
realized
today.
I
should
probably
add
this
to
the
run
books,
because
this
is
also
in
issues,
but
I
think
it's
going
to
be
useful
if
we
put
it
in
the
run
books,
because
then
I
can,
I
could
just
be
more
specific
about
what
we
want
to
do
on
gitlab.com,
rather
than
the
general
you
as
a
generic
gitlab
administrator.
A
This
is
what
you
do
here.
So
yeah.
D
A
Documenting
this,
like
this
is
optional
at
the
moment,
the
eventual
end
goal
would
be
that
we
probably
do
something
automatic
if
this
works
well
for
us,
but
we
could
leave
that
for
like
a
while,
because
we
still
have
the
q
selector
and
the
idea
is
that
we
don't
need
the
q
selector.
A
If
we
do
this
because
you
will
be
listening
to
a
handful
of
cues
and
there
will
be
these
virtual
cues,
I
guess
that
are
configured
like
you
know
that
are
rooted
within
the
application
itself,
but
yeah
it's
not
it's
not
a
high
priority
to
move
everyone
over
to
that
that
way
of
working.
I
don't
think
so.
C
Just
a
quick,
I
don't
know
how
to
call
it
then,
but
reading
reading
the
document
that
you
placed
in
that
merge
request,
it
was
nearly
impossible
to
me
to
figure
out
whether
I
should
run
this
or
not
right.
B
A
Okay,
so
I've
got
the
next
thing
as
well,
which
is
about
how
we
actually
roll
this
out.
So
I
spoke
to
craig
about
this
and
we
yeah.
So
we
we
talked
a
bit
asynchronously
and
then
we
had
a
chat
yesterday
evening,
my
time
yesterday
morning,
his
time
and
yeah,
so
we
wanted
to
like
clarify
a
couple
of
things.
So,
first
of
all
the
scope
of
the
project
that
we're
working
on
the
moment,
we
said
catch
all
initially,
which
is
kind
of
ambiguous,
because
we
have
two
catch-alls.
A
We
have
one
on
vms
and
one
on
kubernetes
and
they
listen
to
different
queues.
We're
making
it
explicit
that
this
is
for
the
kubernetes
one
one
listens
to
more
queues
already.
A
I
made
a
recent
change
with
scarbec.
That
means
that
by
default,
new
workers
go
to
kubernetes
rather
than
vms,
because
obviously
we
don't
want
to
add
new
stuff
to
vms
that
potentially
might
not
migrate
to
kubernetes
that's
kind
of
a
nightmare.
So
like
it's
much
better.
If
it
goes
there
first
and
of
course,
you
know,
migrations
in
general
are
going
from
vms
to
kubernetes,
not
the
other
way
around.
So
we're
making
it
explicit
it's
about
that.
A
I
also
andrew
you
might
remember
a
while
ago
we
stopped
listening
to
some
queues
on
gitlab.com
to
get
a
small
cpu
drop
on
the
redis
sidekick,
so
there
are
like
400,
odd
cues
and
we
figured
there
were
like
30
to
40
of
them
that
we
don't
actually
use
in
production
like
some
goq.
Some
of
the
chaos
and.
D
If
I
recall,
we've
got
some
funny
alert
that,
if
anything
ever
appears
in
those
queues,
we'll
yes
generate
an
alert
yeah,
that's
the
same
thing:
yeah
out.
A
A
D
A
A
Yeah,
so
this
this
is
now
simpler.
So,
like
it's
still
quite
a
long
selector
wait,
I'm
not
sure.
Are
you
intending
to
share
your
screen.
Sorry
yeah!
I
I
forgot
to
actually
click
the
share
button,
so
catch
all
used
to
be
a
like.
It
was
literally
like
9
000
characters
long
now
it's
the
concatenated
list
of
the
other
shards,
including
I
just
tagged
all
the
workers
that
we
don't
currently
run
on
kubernetes,
so
we
can
put
them
there.
A
That's
still
not
ideal
because
it
means
in
the
application,
there's
a
thing
that
says
exclude
from
kubernetes
and
there's
no
actual
reason
to
exclude
most
of
those
from
kubernetes,
except
that
that
was
what
we
were
already
doing.
So
this
needs
to
be
a
temporary
situation,
not
a
long-lived
situation,
but
for
now
it
makes
that
a
lot
simpler
to
reason
about,
and
then
we've
also
got
an
exclude
from
gitlab.com
tag,
but
I
think
I
might
be
in
the
wrong
yeah.
I
think
I
might
be
in
the
staging
file
here.
A
I
can
never
remember
which
is
which,
but
anyway,
we
added
that
exclude
from
gitlab.com
tag
to
some
workers
as
well
and
again,
the
idea
is,
that's
temporary.
I
actually
applied
that
to
staging
and
production,
and
then
it
turns
out.
I
should
have
realized
we
actually
do
use
geo
in
staging.
So
the
got
we're
like
wait.
Why
is
nothing
working.
A
A
A
A
Yeah
there
we
go
so
the
lower
line
is
after
the
change,
and
the
above
line
is
before
the
change.
So
this
is
week
on
week.
Sorry,
so
that's
you
know
it's
a
drop.
It's
not
not
really
gonna
make
any
headlines,
but
like
we
did,
we
did
save
that
cpu
back.
A
A
So
it's
quite
a
nice
way
of
like
having
the
first
part
of
the
rollout,
be
something
that
would
only
affect
staging
anyway,
like
I'm
not
saying
we
make
the
change
on
staging
at
the
same
time
as
production,
but
even
if
we
did,
it
wouldn't
actually
be
impacting
production.
So
that
worked
out
quite
nicely
so
yeah
the
plan
is
we'll
basically
do
those
we'll
then
define
sorry.
A
I
should
just
keep
just
keep
sharing
my
screen
instead
of
turning
off
and
on
so
the
way
the
routine
rules
work
is
that
they're,
global
and
obviously
andrew
you'll
know
this,
because
we
talked
about
this
event
like
a
year
ago,
but
they're
global.
So
at
the
moment
each
sidekick
shard
only
knows
which
cues
it
processes.
A
So
this
is
a
priority
list,
which
means
we'll
have
something
like
this
initially,
where
we
sort
of
define
all
these
shards
and
null
just
means
match
the
rule,
but
don't
change
the
queue
name,
so
we
can
have
them
all
there
and
then
exclude
from
gitlab.com
can
go
to
default
because
that's
the
way
we
can
test
on
staging
and
then,
as
we
want
to
roll
this
out,
we
can
just
add
selectors
up
here.
We
don't
need
to
combine
into
one
mega
selector.
We
can
just
add
so
the
plan
there.
Oh
sorry,
just.
D
A
Where
is
this
config
specified
just
so
it
can
help
me
sort
of
oh.
A
In
gitlab.yaml
in
the
rails,
app
okay
gets
there
via
omnibus
or
via
charts,
so
yeah
yeah,
okay,
thanks
yeah
and
yeah.
So
this
would
be
the
sort
of
initial
state.
So
we've
we're
rooting
the
ones
that
don't
do
anything
on
production
anyway,
and
everything
else
ends
up
in
null,
which
means
that,
like
you
know,
the
rest
of
this
is
essentially
no
ops.
It's
just
to
make
it
clearer
to
people
what's
going
on.
A
After
that,
we
can
go
to.
You
know
actually
migrating
workers
that
do
stuff
and
we're
not
like
set
in
stone
on
this.
But
one
way
we
thought
might
be
reasonably.
Neat
might
be
to
do
it
by
feature
category,
because
we
have
some
feature
categories
that
represent
some
sort
of
fairly
low
volume
workers,
and
then
we
don't
get
back
into
the
situation
where
we
have
like
a
bunch
of
things
selected
by
name
that
we
then
need
to
untangle.
A
A
So
that
was
that
was
sort
of
a
nice
thing.
It's
a
shorter
selector
and
yeah,
and
then
once
we
get
to
the
the
final
stage,
we
would
be
simplifying
again
because
we'd.
D
A
Select
here
so
that
if
it
matches
any
of
the
other
ones,
we
still
don't.
D
A
A
Think
that's
kind
of
a
nice
model
anyway,
just
in
terms
of
like,
like
I
said,
understanding
what
it
does,
it's
just
like
read
it
down
until
you
find
one
that
matches
and
then
stop,
and
it
does
mean
that
we
can
do
things
like
this,
where
we
can
just
add
that
arbitrarily
in
there
and
we
don't
need
to
change
everything
else
to
to
fit
and
then,
at
the
end
we
would
delete
this
line,
delete
any
feature,
category
lines
we
had
and
just
change
this
null
to
default,
and
it
should
all
work
job
done
so
yeah.
A
The
end
configuration
should
be
fairly
simple.
At
that
point
we
then
need
to
decide
like
if
we
want
to
go
through
and
like
allow
arbitrary
queue
names
to
push
these
two.
D
So
we
might
and
there's
no,
I
I
think
the
answer
I
know
the
answer
to
this,
but
you've
probably
thought
this
through
a
lot
more
there's
no
sort
of
risk
from
the
the
speed
at
which
that
conflict
change
is
going
to
get
rolled
out
across
vms
and
kubernetes.
Pods
is
probably
going
to
be
quite
vastly
different
yeah.
A
Right,
so
that
is
actually
a
very
good
question,
so,
when
I
sort
of
skimmed
over
at
the
end
there
I
said
like
this
is
fairly
straightforward
to
do
for
the
other
ones.
The
thing
with
the
other
ones
is,
we
would
need
to
add
a
new
cue
and
start
listening
to
that
cue
before
anything
showed
up
in
it
right
because
we'd
need
to
say,
listen
to
all
the
cues
related
to
say,
project
export,
but
also
listen
to
this
dedicated
queue
that
will
contain
all
the
jobs
for
project
export.
A
Once
I've
done
this
next
thing,
so
we
need
to
make
sure
we
do
that
in
two
steps
for
this
one.
Specifically,
we
pick
the
default
queue
which
is
there
and
is
never
used
for
anything
but
is
already
listened
to
by
the
catch-all
nodes.
So
we're
going
to
get
that
for
free
and
that's
why
we
pick
the
default
one,
because
okay,
yeah.
C
Doesn't
go
that
way
for
some
unexpected
reason,
it
might
be
valuable
figuring
out
how
to
pair
with
delivery,
to
actually
get
the
rest
of
the
vm
sidekick
queues
over
to
kubernetes,
because
we
we
don't
have
a
blocker
anymore
for
it
yeah
there.
C
A
Craig-
and
I
were
actually
talking
about
that
yesterday
because
I
think
the
pages
thing
is
like
it's
well
java
will
know,
because
he's
also
on
that
issue,
it's
a
temporary
directory
that
happens
to
be
on
the
nfs
mount,
but
we
don't
think
it
needs
to
be
on
nfs.
It's
just
that!
That's
where
it
happened
to
be
historically
so
inc
or
no.
F
Well,
no,
we
know
I
mean
I
think,
but
maybe
you
said
this
sean
like
it
made
sense
for
it
to
be
there
because
you're
moving
temporary
files
and
you
want
to
keep
it
on
the
same
volume
right.
A
F
Yeah
yeah,
so
we
so
I
think
that
was
a
smart
decision
at
the
time.
It's
just
that
we're
still
using
this
as
a
temporary
scratch
space,
so
we're
thinking
about
just
reconfiguring
the
page's
root
directory
to
be
somewhere
else
other
than
the
nfs
mount,
which
will
write
the
temp
files
to
the
root
partition
where
we
have
like,
on
average,
10
gigabytes,
free,
we're
thinking,
that's
okay,
I
mean
another
option
would
be
to
expand
the
root
volume
temporarily
yeah
I
mean
it's
either
root
or
var
log,
because
there
are
two
options
we
could.
F
We
could
create
the
temp
directory
and
we
created
the
page's
root
directory
in
var
log.
It
just
feels
weird,
maybe
that's
a
better
option,
though,
if
this
is
only
temporary
anyway,.
D
D
B
Of
the
things
directly
yeah
I
mean
they're
they're,
both
neither
one's
an
ssd
is
oracle
and
one
is
much
much
larger
than
the
other
and
there's
a
meaningful
consequence
to
filling
root
and
there's
not
to
filling
var
log.
So
yeah
go
with
a
safer
route,
even
though
the
name
go.
A
Yeah
one
thing
on
that
move:
point
jav
that
I
realized
after
I
posted
that
was
like
yeah.
It
makes
sense
that,
like
you
know,
if
you
want
to
move
a
file
from
temp
to
thingy,
you
know
you
don't
want
to
move.
A
Cool
yeah,
so
marin
craig-
and
I
did
sort
of
discuss
that
briefly
yesterday
when
we
spoke
because
we
were
like
the
other
thing,
is
once
we've
done
catch-all
on
kubernetes.
The
next
best
target
is
to
stop
listening
to
all
the
cues
on
catch-all
vms
and
one
way
to
do
that
would
be
to
give
catch
all
vms
its
own
queue
and
like
migrate
to
this,
but
another
one
would
just
be
to
migra.
A
I
say
just
it's
a
big
just
just
be
to
migrate
those
jobs
from
vms
to
kubernetes
and
get
this
as
a
result
so
yeah.
We
think
that
would
be
the
natural
next
biggest
impact,
assuming
this
makes
a
decent
impact
so
yeah,
I
think,
I
think,
carrying
on
with
the
kubernetes
migration.
There
makes
a
lot
of
sense
instead
of
avoiding
double
work.
Yeah.
C
Yeah,
I
might
see
if
we
can
prioritize
that,
like
sooner
because
as
soon
as
api
is
in
production,
we
we
could
like
take
a
short
break
from
the
next
service
and
see
how
api
behaves,
and
while
we
do
that,
just
migrate,
all
of
the
sidekick
and
and
be
done
with
it,
make
it
simpler
for
everyone.
A
So
yeah
that
was
that
was
the
shawn
show.
I
guess
for
the
demo.
Does
anybody
have
anything
else
they
wanted
to
talk
about
bob
had
something
on
the
agenda,
but
he's
not
here.
A
All
right
I'll
upload
this
shortly
thanks.
Everyone
have
a
great
day,
thanks
sean
sorry,
andrew
do
you
want
to
go.
D
No,
no,
no,
it
was
something
that
just
I
just
crossed
my
mind,
but
it
was
reminded
by
bob's
thing
and
it
does
anyone
know
about
it's
more
question
than
anything
else.
D
Do
we
have
any
observability
around
samuel,
because
there
is
a
thing
that's
happening
today
we
had
an
incident
and
I
looked
at
something
and
I
saw
that
I
actually
blocked
an
account
that
wasn't
causing
the
incident,
but
that's
a
that's
the
thing
for
another
day,
but
actually
what
it
is
is
every
time
this
person
uses
the
api,
it
uses
400
megabytes
of
memory
and
heinrich
looked
at
the
at
the
at
the
request.
D
He
said:
there's
nothing
in
this
request
that
uses
for
it
makes
a
gidley
call,
and
then
we
realized
that
the
customer
is
using
saml
and
it's
also
using
like
six
seconds
of
cpu
every
request,
which
is
really
crazy,
and
so
I
started
looking
around.
I
couldn't
find
anything
in
our
logs
and
I
couldn't
find
anything
in
our
metrics
for
saml
and
that's
kind
of
scary.
A
Yeah,
I
don't
think
I
don't
know,
I
don't
think
we
have
anything
like.
I
do
know
that
the
the
risk,
some
of
the
responses
for
samuel,
I
think,
can
be
big,
but
I
don't
know
why
that
would
be
when
we
make
a
italy
request
that
happens
so
yeah
no,
but.
A
E
B
Was
was
the
was
the
same
operation
also
correlated
with
the
cpu
bearing
you
observed,
because
that's
an
enormous.
D
Amount
of
cpu
yeah,
it's
it-
I
mean
it's
up
to
seven
seconds,
so
the
the
I
we
I
don't
really
know
yet
and
I
haven't
had
a
chance
to
look
at
it
in
much
detail
today
yet,
but
the
the
call
itself
is
is
really
basic
and
actually
the
call
fails
because
there's
no
repository,
so
it's
not
there's
like
literally
it
should
be
a
404.
But
it's
a
500.
because.
D
What
you're
saying
yeah
yeah
yeah
the
giggly
response
is
like
not
found
and
and
that's
all
it's
doing,
and
so
the
there's
something
else
for
these.
The
only
other
thing
is,
I
think
it's
an
api
token,
but
I
don't
know
how
api
tokens
interact
with
saml,
because
presumably
you
get
the
api
token
from
gitlab,
not
from
saml,
but
it
probably
has
to
do
some
some
dance
there,
but
yeah
I
mean
if,
if,
if
their
saml
provider
is
giving
back
like
a
50-meg
chunk
of
jason
every
time,
we
call
them,
then
that's
not
really
that
good.
D
But
I'll
I'll
open
an
issue
about
it.
D
A
D
D
A
D
I
mean,
I
think
it
and
you
know
the
the
this
particular
customer
looks
like
a
really
they've
got
six
accounts
and
so
they're,
probably
using
like
I
don't
know,
a
sample
provider,
but
it
would
be
interesting
to
especially
now
that
we've
already
reached
out
to
them
to
tell
them
that
we've
blocked
their
accounts.
It
would
be.
It
would
be
like
interesting
to
to
figure
out
what's
going
on
there
but
yeah
I'll
open
up.
D
D
Obviously
there's
a
lot
of
good
that
we
get
a
huge
amount
of
good
like
all
the
mechanical
sympathy
logs
and
like,
but
there
is
also
like
a
and
we
can
pinpoint
exactly
you
know,
which
request
is
bad.
We
can
do
all
sorts
of
good
things,
but
you
know:
do
we
just
keep
adding
like
all
of
these
things
to
the
access
log.
D
A
D
Yeah
application
settings
table
yes,.