►
From YouTube: Redis Sidekiq Scalability Experiment Demo
Description
A quick run through of the redis sidekiq scalability test harness from https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/956
It is crude, but has given us good numbers.
A
Hey
this
is
a
quick
demo
of
the
work
I've
been
doing
with
the
redis
psychic
experiments.
The
goal
of
this
is
to
get
to
our
base
load
where
we
can
see
roughly
production
level
expert
performance
out
of
redis
on
in
the
smallest
environment.
We
possibly
can
and
then
experiment
with
the
various
mitigations
that
we've
got
proposed
and
planned
and
reasoned
out
to
see
which
one's
going
to
actually
have
an
impact
to
confirm
that
we're
going
to
have
an
impact
and
see
what
the
scale
of
the
impact
is.
A
The
full
details
are
on
scalability
issue
956.
I
recommend
you
go
and
look
at
that.
If
you
have
any
questions,
I'm
going
to
skim
over
here
very
very
quickly.
Let
me
finally
share
my
screen.
I
want
to
share
this
one
here:
okay,
right
noting
other
things,
I
have
three
instances.
I
have
a
regis
side,
redis
node,
a
client
node
and
a
prometheus
node,
the
redis
binaries.
I
told
you
this.
It
is
simply
binaries
copied
from
production.
A
Compiling
was
a
pain,
it
wasn't
worth
it
and
it's
actually
slightly
nicer
to
be
really
sure
about
the
same
binaries.
I've
got
from
production
or
staging.
So
I've
got
5.9.9,
that's
the
older
one,
6.5.10
straight
up
and
6110
patched,
which
is
with
the
our
br
pop
fix
that
igor
and
matt
worked
on
together.
A
The
configurations
have
been
very
basically
from
production
very
lightly
modified.
A
Just
where
necessary
are
setting
directories
a
simple
password,
so
I
don't
have
to
care
about
that
particularly
much
cool.
So
that's
the
client
node.
We
select
with
a
little
bit
shell
script,
six
subset
of
five
six
or
six
p
six
patch
and
then
run
which
selects
the
configuration
file
and
just
runs.
So
that's
done
out
of
the
way
no
problems
on
the
client.
This
is
a
little
bit
more
complicated,
so
we
have
created
the
worker
classes.
A
A
So
this
is
not
exactly
a
one-to-one
mapping
in
the
code
base.
It's
weird
in
terms
of
the
classes
and
namespaces
and
which
queues
they
get
put
into.
What
matters
is
that
we've
got
a
class
per
cue
that
we've
got
and
that
they
do
something.
The
sleep
is
a
random
duration
and
if
we
look
at
application
worker,
basically
we've
got
gravity
zones
out
of
out
of
kibana
various
percentiles
10
intervals
and
p95
and
p99,
and
we
randomly
choose
one
of
those.
A
So
what
this
means
is
that
the
jobs
will
be
flowing
through
at
a
good
rate,
it
mimics
production.
To
some
degree
I
mean
we
are
talking.
You
know,
ten
percent
up
the
bottom
ten
percent
will
take
only
yeah
0.02
in
a
second.
We've
got
to
be
able
to
throw
a
lot
of
jobs
through,
but
this
gives
us
a
reasonable
layout,
a
reasonable
shape
for
those
jobs
for
how
long
they
take
how
long
they
actually
take
that
we
ignore
the
last
percent,
one
percent
they're
about
p99,
because
it
gets
up
like
1500.
A
Some
of
these
jobs
take
10
15
minutes
to
run
it's
not
interesting
in
the
scale
of
what
we're
trying
to
do.
We're
looking
at
the
high
throughput
stuff,
a
single
psychic
worker,
just
they're
doing
nothing
for
30
minutes,
it's
hard
to
resist,
doing
nothing,
not
relevant,
so
that's
the
workers
that
are
ready
to
run.
We
run
them
with
a
little.
We
sidekick
thing,
so
it's
literally
sidekick
itself
configured.
We
just
require
all
the
files
configure
pointed
awareness.
We
must
use
this
every
reliable
fetch.
A
This
is
interesting,
we'll
be
able
to
do
this
again
when
we
get
past
this
to
the
single
queue
per
shard,
we'll
be
able
to
test
with
the
fully
reliable
feature
in
this
environment
and
see
what
it
does,
because
it
uses
a
completely
different
risk
command.
A
Yes,
commas
are
popular
instead
of
br
yeah,
so
we
set
that
up
and
that's
the
simplest
thing.
Sorry
once
again
and
right,
you
can
run
it
with
this,
so
we
literally
oh
I've,
also
copied
in
basically
most
of
get
lab
embedded
from
production.
A
A
A
It's
worth
noting
just
in
case
you're,
not
fully
familiar.
That
catch-all
has
two
fleets
of
workers,
vms
and
k8s
communities.
The
vms
are
running
the
ones
that
we
still
think
might
might
or
do
need
nfs.
The
communities
ones
are
the
ones
that
have
validated
explicitly
do
not
know
any
more
require
anything.
They
are
two
discrete
sets
of
workers
from
the
catch-all
shard,
the
ones
with
no
resource
boundaries,
so
that
runs
sidekick.
A
So,
with
this
run
multiple
sidekicks,
I
can
run
a
whole
bunch
of
them
in
the
background,
and
so
with
all
of
those
running,
and
it
looks
a
little
like
I
will
run
them
sort
of
like
that,
so
I
will
run
them
on
the
catch-all
vm
chart
or
56
of
them.
I
grab
those
numbers
literally
from
our
current
production
number
of
running
workers
across
vms
and
kubernetes
and
pods
and
processes
underside,
cluster
and
all
that
stuff.
So
if
we
try
to
get
the
same
number
of
worker
threads
and
from
sidekick
talking
to
readers.
A
I
think
this
is
okay
in
terms
of
what
we
are
trying
to
test
here
every
time
I
think
about
it
that
bounces
back
and
forth
in
my
head,
I
think
redis.
Doesn't
it
doesn't
matter
for
performance,
particularly
much,
whether
we're
listening,
whether
we
are
putting
work
into
one
queue
or
work
into
20
queues?
A
I
think
just
because
it's
single
threaded
and
it's
only
one
data
structure-
it's
there's
no
locking
to
worry
about.
So
I
think
that's.
Okay,
I'm
gonna
have
to
get
a
little
bit
further
than
that
to
do
some
of
the
other
experiments
we
want
to
do
so.
We
will
see
how
what
happens
then,
but
I'll
be
putting
that
off,
because
it's
not
quite
the
tests.
We
really
are
interested
in
very
brave.
Basically,
I
I
just
stuff
a
whole
pile
in
these
numbers.
Here
are
the
production
work
accounts.
A
So
I
do
that
I've
stuck
with
that.
I
haven't
gone
back
and
re-edited
it
because
I
can
generate
production
like
load
and
then
we
just
schedule
the
work
in
here.
So
I
actually
do
it
twice
and
that's
still
barely
keeping
up
so
in
practice
we
actually
find
we
need
to
run
this
generator
twice.
So
it's
really
cramming
work
in
there
as
fast
as
it
can
two
or
three
times
to
get
load
up
to
base
load.
A
What
we
also
have
so
that's
the
base
load.
We
also
have,
for
example,
one
cube
for
a
shard
from
the
one
key
for
chart
experiment.
The
only
difference
is
this
here
so
instead
of
instead
of
simply
performing
on
the
class,
it
sets
the
queue
explicitly.
So
it
overrides
all
the
other
work.
That's
going
on
to
select
the
queue
and
just
use
the
state
name
there,
and
so
then
we
have
slightly
different
config
files
for
sidekick.
A
There,
where
it
just
listens
on
one
queue
so
to
see
it
in
action,
redis
is
running
up
here.
We
want
to
start
the
workers,
so
I
will
start
the
workers.
We
will
do
the
full
original
load,
so
we
run
that
there,
I'm
tailing
all
the
logs
over
here.
This
takes
a
minute
or
so
to
start
up
I'll,
just
run
top
here.
A
A
A
So
the
moment
ruby
is
still
starting
up
doing
its
thing,
I'm
probably
terrified
by
that,
but
it's
fine.
Now
we
wait
for
these
logs
to
there
we
go
they're
all
started
up.
Load
goes
up,
and
we
will
see
in
here
here
we
go.
This
is
our
base
load
gradually
rising,
just
drop
it
down,
so
our
base
load.
This
is
also
with
the
br
pop
timeout
set
to
five
seconds,
which
is
what
we're
running
the
production.
Two
seconds
is
worse
than
this,
so
very
good.
We
get
up
to
about
30.
A
This
might
stabilize
in
another
few
seconds,
yeah
just
rise
at
the
start,
when
everything's
connecting
that
must
be
catching
a
few
jobs
cool.
Probably
from
my
previous
tests.
There
we
go
so
yeah
there.
You
go,
that's
the
startup
time
and
it
stabilizes
at
about
12,
that's
consistent
with
our
previous
experiments.
A
A
Yeah,
that's
running
once
just
showing
you
how
long
it
takes
to
do
to
schedule
all
those
cues
it's
about
three
seconds,
it's
quite
a
bit
of
three
to
four
seconds
and
then
we
will
run
two,
and
just
just
trust
me
on
that.
This
is
a
good
thing,
so
I
can
see
all
the
workers
doing
their
thing.
I'll
just
cancel
that
for
a
second,
so
yeah
1.374
seconds,
quite
a
few
of
those
like
that
and
so
on
and
so
forth.
761
yeah.
A
So
that's
busy,
those
are
busy
doing
things
and
we
can
see.
What's
happened
over
here
two
minutes
and
there
we
go
so
with
two
generators.
We
can
get
our
renters
up
to
about.
Actually
you
can
see
where
the
first
one
started
one
and
then
started
the
second
about
90,
which
is
actually
pretty
close
to
what
we're
seeing
in
production
right
now
at
our
peaks.
So
there's
lots
of
variations
here.
This
is
not
entirely
accurate,
but
a
disorder
of
magnitude
there
and
not
only
automatically
pretty
close
over
magnitude.
A
So
I'm
very
happy
with
this,
because
then
we
run
all
the
other
experiments
which
you
can
see
all
the
results
of,
and
we
can
see
what
happens
to
that
like
dropping
this
down
from
90
down
to
50,
which
is
awesome.
I
hope
that
makes
sense.
We
can
get
access
to
this.
It's
not
well
set
up,
but
I
mean
I
can.
If
I
drop
your
ssh
key
on
there
hit
me
up,
we
can
run
experiments
and
we
can
do
this
in
the
future
there,
which
is
awesome.