►
From YouTube: 2020-04-03 Sidekiq Migration demo
A
B
B
B
C
So
this
is
the
Gabbana
dashboard
that
I'd
created
just
for
some
sort
of
temporary,
just
as
we
were
rowing
the
South,
what
it
shows
you
are
the
the
kubernetes
pods
on
top
and
then
the
VMS
at
the
bottom.
The
bottom
is
blank
for
each
of
these
panels
because
we
don't
have
any
VMs
processing
project
export
jobs.
So,
but
when
we
were
running
this
in
parallel,
it
was
kind
of
interesting
to
compare
how
many
jobs
are
being
processed
by
the
pods.
How
many
jobs
are
you
processed
by
the
VMS?
It
smells
like
errors
and
stuff.
C
I
would
say
like.
This
is
typical
for
errors
that
we've
been
seen,
I
kind
of
want
to
dig
into
these
a
bit
more
I
just
haven't
had
time,
but
these
were
kind
of
typical
errors
of
you
would
see
both
on
the
PM's
and
the
pods,
and
this
is
kind
of
typical
right
now.
Sometimes
we
get
like
the
big
spike,
the
project
exports,
and
then
we
have
a
big
backlog
and
then
the
apdex
drops
because
there's
queue
latency.
This
has
been
happening,
maybe
like
once
per
day
or
so.
C
We
got
like
a
huge
spike
of
people
running
project
exports
all
fame
when
I've
looked
into
it.
That's
someone
who's
just
like
running
exports
across
every
project
in
their
namespace.
We
do
have
rate
limiting
for
this
on
a
single
project,
but
not
on
the
namespace.
So
it
could
be
something
that
we
might
consider
to
prevent
this
sort
of.
C
Well,
sorry,
I
just
was
going
off
about
people
who
have
these
interesting
usage
patterns
and,
like
you
know,
typically
once
per
day,
or
so
we
see
someone
backing
up
every
project
in
a
given
namespace
and
we
don't
have
any
specific
rate
limiting
for
a
name
at
the
database
level.
It's
just
at
the
project
level.
A
C
C
C
C
C
C
B
C
B
D
A
E
E
C
E
Like
that,
90%
of
jobs
get
exported,
it's
6
UTC
every
day,
because
that's
when
daily
CI
jobs
run
and
lots
of
daily
CI
jobs
kick
off
a
project
export
and
with
those
two
you
know
those
queue
up
for
like
an
hour,
and
so,
if
you
look
at
the
average
every
day,
it's
like
you
know,
we
have
lots
of
jobs
that
take
a
very
long
time
to
get
scheduled.
So
I
wouldn't
be
too
stressed
about
like
how
long
it
takes
to
beat
those
those
jobs.
E
I
mean
we've
moved
project
export
to
be
what's
called
throttle,
so
we're
not
kind
of
judging
it
on
queueing
time,
because
we
are
kind
of
protecting
the
fleet
from
you
know
too
many
exports,
so
we'd
rather
have
the
the
each
export
take
longer
and
protect
the
fleet.
So
I
would
say
back
because
it's
a
throttle
job
like
rather
you
know
you
can
scale
it
down
and
we're
not
judging
it
on
on
on.
It
takes
two
weeks
up
a
new
part.
D
C
C
A
B
C
E
Maybe
the
thing
to
do
is
to
just
kind
of
like
try
it
and
then,
and
then
we
can
kind
of
collect
some
data
as
to
like
how
long
it
takes
before
and
then
do
it
for
a
day
or
something
and
on
that
day
take
a
look
and
see
what
the
average
difference
in
time
is
like
a
lot
of
those
jobs,
keep
for
quite
a
long
time.
Yeah.
C
I
mean
if
you
look
at
the
average,
then
you're
right
I
mean
it's
probably
not
going
to
make
any
difference,
but
I
just
like
I
mean
we
have
these
nodes
running
anyway.
We
have
one
per
Val
ability
zone.
It's
not
like
we're
spending
money
by
having
four
pods
and
reserved,
and
once
we
move
to
queue
groups,
we're
gonna
have
probably
more
pods.
You
know
in
reserve
right
because
there'll
be
more
jobs,
running
okay,.
C
C
C
A
E
D
E
We
can
do
exactly
the
same
thing,
because
that
all
that
does
is
it
goes
to
like
a
common
dashboard
like
a
common
library
that
builds
it
up
according
to
a
selector,
a
node
selector
that
you
give
it
and
what
we
should
do
is
add
the
exact
same
row
before
for
giving
any
stuff
right
so
that
when
you're,
looking
at
the
priority
detail,
for
you
know
whatever
project
exports,
you
get
that
little
panel
that
you
can
open
up
and
it's
common.
You
know
to
all
you
know
you
get
it
in
the
queue
detail.
A
A
E
C
I
think
we
have
to
decide
how
we're
going
to
what
the
feature
is
of
the
kubernetes
dashboards
right,
because
we're
using
the
Nexen
and
it's
fairly
opinionated
about
how
things
are
laid
out.
I
mean
we
can
either
just
kind
of
take
what
it
gives
us
and
extend
it
or
we
could
revamp
it.
And
you
know
we
haven't
really
decided
yet.
E
C
A
C
C
E
C
C
C
C
C
You
know
it's,
it
depends
like
there's.
A
carrier
wave
still
writes
to
the
uploads
temp
directory,
so
there
are
cases
where
it
creates
files
there,
which
is
NFS
mounted,
but
the
the
temporary
file
that's
created
when
sidekick
goes
to
get
aliy
and
like
grabs
the
hard
whole
project
and
then
writes
at
the
disk
before
it
uploads
it
to
upload
storage,
that
is
on
local
disk,
on
both
the
VMS
and
obviously
in
kubernetes
as
well.
B
E
So
so
this
is,
it's
actually
a
problem
that
we've
had
for
a
while,
but
it
hasn't
been
that
bad
and
you
might
have
noticed
on
some
of
the
dashboards
like
you
know,
we
have
the
sort
of
the
key
metrics
along
the
top
of
some
of
the
dashboards.
You
might
have
noticed
on
some
of
them.
There's
like
two
lines,
so
you
might
have
liked
to
up
years,
and
the
reason
why
that
is
is
because
we
might
be
back
for
especially
for
some
services.
E
Like
web
pages,
we
do
some
of
the
metrics
in
a
check
box
and
a
check
boxes
on
Prometheus
default
meters,
and
then
some
of
the
metrics
are
on
Prometheus
app
and
we
don't
have
a
way
of
rolling
those
up
to
get
like
a
single
view,
and
so
we
actually
have
these
two
metrics.
But
we
only
have
that
for
I
mean.
D
E
D
E
So
basically,
what's
happening,
that's
kind
of
strange
because
it
seems
to
be
okay,
but
it's
not
so
what's
happening
here.
Is
we've
got
a
service
called
Patroni
and
the
way
that
we
measure
the
the
latency
of
the
Patroni
service,
like
like
most
Kris,
doesn't
give
us
like
in
apdex
score,
and
so
we
use
the
application
and
we
look
at
how
long
simple
queries
are
taking
we
say.
E
Well,
you
know
we
want
99%
of
sequel
queries
to
take
less
than
a
second
and
that's
how
we
generate
the
update
score
and
then
we
aggregate
that
up
at
a
service
level,
and
then
we
come
up
with
this
value
over
here,
which
is
for
the
service.
But
now
that
we've
taken
some
of
our
metrics
and
we
sorry
some
of
our
rails
application
and
we
are
running
it
in
communities
to
in
order
to
service
the
project,
export
jobs.
E
Those
metrics
aren't
going
to
the
same
Prometheus
server,
they're,
going
to
the
forget
its
name
but
I
just
put
Prometheus
gates,
but
it's
the
GRP
dk8,
whatever
it
is,
and
so
it's
going
to
a
different
service
and
that
we
have
recording
rules
which
take
those
and
aggregate
them
to
a
service
level.
But
effectively.
We've
got
the
split
brain
where
Prometheus
app
has
its
view
of
the
world,
and
it's
got
a
lot
of
traffic
in
it.
E
And
it's
saying
I'm
aggregating
all
the
data
that
I
know
about,
say
in
this
case
the
rails,
sequel,
timings-
and
this
is
what
I
know,
and
this
is
how
good
it
is.
But
then
you've
also
got
the
the
Kuban
Eddie's
Prometheus
and
it's
aggregating
a
much
much
smaller
set
of
data
right,
because
it's
basically
the
sequel,
queries
that
come
from
project
export
jobs
and
it's
using
that
as
an
aggregation.
And
so
we
have
alerts
around
that
that
are
basically
firing
because
of
the
split
brain.
E
We're
having
to
education
is
one
of
like
a
small
set
of
data
and
one
of
the
medium
set
and
two
things
can
happen.
We
get
the
split
brain
where
the
data
is
not
that
good,
because
we
kind
of
just
looking
at
project
export
instead
of
like
the
entire
rails
application.
But
the
other
thing
is
like
there's
an
example
of
where
we
compare
like
for
project
exports.
E
We
compare
jobs
getting
in
queued
to
jobs
getting
executed,
but
now
the
job
in
queue
is
happening
on
one
Prometheus
server
and
the
job
execution
is
happening
on
a
different
Prometheus
server
and
the
two
don't
know
about
each
other,
so
the
loads
are
firing
because
in
in
the
universe
of
their
lurks,
that's
that's
been
executed
in
the
one
Prometheus
server.
It
doesn't
know
that
the
jobs
getting
executed,
because
that
execution
is
being
recorded
in
a
different
Prometheus
server.
B
E
E
Your
metrics
to
the
service
level
right
so
you're
saying
that
I
want
99.9
percent
of
the
requests
coming
into
the
service
to
be
successful
or
to
have
a
latency
of
X
and
if
I,
don't
then
I'm
gonna
raise
it
like
a
symptom
based
of
it's
a
symptom
is
that
the
service
is
not
functioning
well,
and
that
is
all
based
on
aggregating
your
metrics
up
to
a
service
level,
rather
than
doing
it
at
a
like.
A
single
metric
value.