►
From YouTube: Scalability Team Demo - 2021-06-17
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
And
could
be
of
the
first
one
this
morning.
B
Yes,
I
put
a
very
nerdy
topic
on
the
agenda.
B
There
was
a
discussion
in
our
sex
channel
last
week
where,
in
the
context
of
error
budgets,
where
bob
pasted
a
pronq,
ul
formula
and
said
how
do
I
say
what
this
is
calculating
and
he
was
probably
trying
to
answer
the
question
from
somebody
who
wanted
to
understand,
error
budgets
and
I
didn't
understand
at
all-
and
I
started
to
get
worried
that
we
were
doing
something
very
fishy
in
our
calculations
and
then
bob
currently
created
an
issue,
and
then
I
thought.
Okay.
B
C
B
C
B
Well,
andrew
you're,
probably
the
one
person
who
is
most
comfortable
with
this
calculation,
because
I
I
think
you
worked.
B
No,
probably
not
yeah,
so
I,
the
topic
I
put
on
the
agenda
is
riemann
sums
and
that's
very,
very
nerdy,
because
riemann
is
a.
I
don't
know
a
19th
century
german
mathematician.
B
But
the
nice
thing
is
that
if
you
google
riemann
sum,
then
you
get
a
wikipedia
page
with
nice
pictures
that
are
that
help
to
understand
what
we're
doing
so.
I
think
I
have
an
excuse
for
being
so
nerdy
and
calling
it
the
riemann
sum
because
it
helps
it
guides
you
to
a
picture.
But
first
I
want
to
show
what
the
question
was
about.
So
the
formula
was
some
overtime
failure
rates
one
hour,
28
days,
this
formula.
Well,
actually
this
one
bigger
letters.
B
So
what
what
does
that
mean
and
the
the
answer
is
too
complicated?
So
the
answer
is:
it
is
the
number
of
failures
in
the
past
28
days,
divided
by
60,
assuming
that
our
recording
rules
create
a
data,
create
a
rate
metric
every
60
seconds
if
it
was
every
five
minutes
that
would
be
300
seconds
and
it
would
be
divided
by
300..
B
B
We
want
to
see
an
increase
over
a
long
range
and
the
data
we
have
are
rates
and
a
rate
is
instantaneous
like
how
much
things
were
increasing
at
a
point
in
time,
but
it
doesn't
tell
you
immediately
how
much
increase
you
had
over
a
window
of
time
and
what
you
need
to
do
if
you
want
to
go
from
a
rate
to
the
actual
increase,
is
to
take
the
area
under
the
graph.
B
I
don't
know
if
this
makes
sense,
but
and
that's
what
integrals
are
in
mathematics
just
again
very
nerdy,
an
integral
is
taking
a
graph
and
computing
the
area
under
the
graph.
B
Now,
what
we're
doing
in
our
formula
is
so
this
line
would
be
failure
rate
one
hour
and
and
we
these
dots
are
where
we
have
values
and
what
we're?
What
I'm
assuming
is
that
the
time
between
these
dots
is
always
the
same.
It's
always
one
minute!
B
So
if
you,
if
these
are
our
failure
rates-
and
it
would
be
the
rate
here
times-
60
for
60
seconds
in
the
rate
here
times-
60
the
rate
here
times
60
well,
they
do
it
on
the
other
side.
But
let's
ignore
that
for
a
moment
and
you
you
constantly
do
rate
times
60
plus
rate
times
60
plus
rate
times,
60
plus.
B
Does
that
make
sense?
Now,
if
you
think
of
multiplication
and
addition,
if
you
have
something
a
lot
of
pluses
and
every
time
you
do
time
60,
then
you
can
also
first
sum
up
the
things
and
multiply
by
60
once
at
the
end,
that
is
called
distributivity.
B
So
because
we're
assuming
that
all
these
things
are
evenly
spaced,
it's
always
times
60..
So
you
can
also
say
it's
60
times
the
sum
of
the
rates,
and
that
is
then
the
area
under
the
graph,
and
so
that
that's
one
and
then
why
don't?
We
write
60
times
some
overtime
failure
rate
because
we
take
these
increases,
but
we
always
divide
by
another
increase.
So
the
60
is
both
above
and
below
the
dividing
line,
it's
in
the
numerator
and
in
the
denominator,
so
it
disappears.
B
C
Can
I
just
this
is
so
awesome
that
like
because
this
is
all
in
my
head
when
I
was
doing
it
and
I
didn't
express
it
very
well,
but
the
one
just
on
that
point,
yeah
one
thing
that
made
me
feel
confident
about
cancelling
out
the
60
on
the
top
and
the
bottom,
because
that's
how
I
thought
about
it.
I
was
like
60
on
the
top
and
60
on
the
bottom
was
that
both
of
those
configurations
are
in
the
same
recording
rule
group
so
and,
and
that
was
like
they
they
always
will
run
paired
together.
C
B
And
at
the
same
time,
so
yeah-
and
I
think
so,
one
of
the
reasons
so
the
other
thing
I
was
curious
about
was
my
first
thought
when
I
saw
this
expression
was.
Why
do
we
have
a
thing
here
that
depends
on
the
the
recording
rates,
the
rates
at
which
this
the
rate
at
which
we
generate
data
points
like
that.
The
calculation
should
depend
on
it.
C
It
was,
there
was
a
time
where
I
had
a
second
expression,
which
was
the
number
of
samples
or
the
number
of
observations
that
that
we
had
over
that
period
on
the
top
and
the
bottom.
But
then
I
realized
that
that
kind
of
messes
things
up.
You
know
we
didn't
need
it.
Firstly,
we
could
cancel
it
out.
C
It
made
the
query
slower
and
you
know
just
it
just
worked
more
elegantly
if
we
just
cancel
it
out
on
the
top,
so
you
can
look
at
you
can
say
over
28
day
period,
we
saw
you
know
whatever
14
000
or
you
know
whatever.
However
many
samples
and
therefore
we
can
kind
of
infer
from
that
that
that
we're
doing
this
once
every
60
seconds,
you
know
we
can
go
back
from
the
number
of
samples.
C
It
it
I
just
it
worked
out
that
we
didn't
need
to
do
that.
The
the
other
thing
that's
slightly
different
from
a
riemann
pronunciation,
as
I
say
it
is
that
there
is.
There
is
overlap
so
in
that,
in
that
wikipedia
page
that
you
showed
each
of
the
bars
is
taking
up
a
so.
Do
you
want
to
share
your
screen
that
you
had
happen
a
second
ago,
but
I
and
yeah
so
so
those
bars
are
kind
of
each
kind
of
distinctly
taking
up
a
piece
of
the
x-axis.
C
But
actually,
if
you
just
go
back
to
the
ratio,
calculation
that
lost
herbia
you'll
see
here
that
we
are
using
a
one
hour
rate
over
a
yes
yeah.
I-
and
I
I
wasn't
sure
if
that
was
right,
and
I
actually
think
I
I
said
in
a
call.
Luckily,
we
have
someone
who's
very
good
at
maths
here
in
a
demo
call,
but
but
because
I
was.
C
Yeah,
it
was
actually
but
but
the
the
thing
the
thing
about
it
is
that
through
experimentation,
I
because
I
I
didn't
go
through
the
whole
sort
of
approach,
like
you
know
the
mathematical
approach
to
proving
it
was
correct,
but
I
did
lots
of
experimentation
on
it
on
lots
of
different
series
and
it
seems
to
work.
This
is
also
what
we
do.
The
upscaling
with
on
on
the
sidekick
mattress.
B
Yeah-
and
I
I
actually
I
didn't-
make
progress
on
understanding
this
formula
until
I
started
ignoring
the
one
hour,
because
I
think
the
one
hour
is
important
because
well
the
way
I
think
of
the
one
hour-
and
this
is
not
really
okay
mathematical
me
doesn't
like
it
because
it's
too
hand
wavy,
but
for
everybody
else,
it's
probably
fine.
B
B
They
would
be
much
more
jumpy,
but
we
make
the
assumption
implicitly
that
the
rate
stays
more
or
less
the
same
for
60
seconds
and
if
your
rate
is
very
jumpy
and
yeah.
So
if
it's
a
very
it's
like,
if
something
very
local
happened
that
was
off
and
you
get
an
incorrect
rate
and
then
you
multiply
by
60
seconds
then
you're
multiplying
that
in
accuracy.
B
Lucky
and
you
used
to
write
it
then
then,
usually
it
works
out,
but
that's
not
the
world
we're
in
and
so
intuitively.
I
think
these
rates
need
to
be
smooth
and
that's
why
one
hour
makes
sense,
but
apart
from
that,
they
need
to
be
smooth.
I
don't
think
they
really
change
what
the
expression
means
like
you
can
just
think
this
is
the
rates
like.
D
B
Yeah,
so
you
need,
you
need
to
get
points
on
the
graph.
Somehow,
so
that's
and
you
need
some
window
and
one
hour
is
a
valid
choice
that
through
experimentation
andrew
discovered
works
well,
but
it
yeah
it.
If
you
want
to
try
out,
try
and
work
out
what
the
one
hour
means
for
the
approximation
of
the
increase,
then
I
think
the
math
would
get
very
complicated
and
I
just
don't
want
to
do
it.
C
Yeah-
and
I
was
also
worried
about
the
jitter,
because
the
the
the
evaluations
are
not
happening
exactly
six,
you
know
they're
not
back-to-back
60
seconds,
so
there's
always
going
to
be
sort
of,
because
prometheus
is
kind
of
going
in
this
loop.
And
you
know
it's
got
a
it's
got
a
schedule
and
every
60
seconds
ish.
It
will
run
that
recording
rule,
but
it's
not
it's
not
exactly
60
seconds
right
like
depending
on
what
the
server
is
doing,
and
so
there
was
also
kind
of
some
concern
around
that
there's.
C
Another
reason
why
I
think
that
this
is
actually
quite
an
important
discussion
to
have,
and
that
is
around
the
three-day
rates,
because
I
think
we
want
the
three-day
rates
it's
for.
C
Yeah
and
especially
for
the
sidekick
for
the
for
the
low
frequency
sidekick
jobs,
and
so
we
want
the
three
day
rates.
We
can't
really
practically
evaluate
the
sidekick
pods
over
three
days,
because
we'll
just
melt,
thanos
and
sidekick
down
into
yeah.
B
That's
the
other
part
of
the
story
that,
if
you
want
to
know
the
increase
and
you
had
all
the
prometheus
data-
and
you
could
just
write
the
prometheus
query-
that
takes
28
days
worth
of
counter
values
and
calculates
an
increase
for
you,
but
that
melts
prometheus
and
that's
why
we
have
these
rates,
which
are
lower
in
number
yeah.
C
Yeah
and
actually
in
the
days
that
we
were
in
vms,
we
could
pretty
much
get
away
with
it.
It
would
be
a
slow
query,
but
now,
with
the
number
of
pods
and
the
pods
starting
and
stopping
and
giving
you
two
minutes,
we
can't
really
it
just
there's
just
too
many
series
now
and
it's
actually
got
a
lot
worse,
but
the
so
with
the
three
day
one.
C
I
think
what
we're
going
to
do
is
for
everything
so
not
just
for
sidekick,
but
for
the
web
and
for
api.
All
the
three
day
ones
will
be
constructed
from
lower
sample
rates
and
then
kind
of,
I
think
I
think,
in
the
code
base.
We
call
it
upscaling,
but
this
is
where
having
you
looking
at.
This
is
super
fortuitous
right
now,
because
what
I
found
when
I
did
it
with
sidekick,
is
it
worked?
You
know
within?
C
I
remember
I
put
a
graph
on
and
I
think
bob
reviewed
it
and
maybe
craig
furman
and
you
could
kind
of
see
that
they
kind
of
it
kind
of
tracked
the
real
data.
It
wasn't
perfect,
but
it
was
kind
of
a
rounded
version
of
of
the
real
data,
but
then
for
some-
if
I
remember
correctly
for
some
things
like
for
api
or
something
there
were
certain
conditions
where
it
didn't
work
very
well
at
all,
and
I
think
what
it
was
again.
This
is.
This
is
kind
of
scratching
at
the
back.
C
No,
I
think
what
it
was
was
that
when
the
failure
rate,
doesn't
you
know
for
certain
sidekick
certain
things
when
there's
no
error,
there's
no
observation.
So
so
you
know,
if
you
don't,
if
you
don't
have
any
500s,
then
that
series
is
absence.
It's
not
zero
and
that's
for
certain
things.
That
was
a
real
big
problem
and
that's
one
of
the
reasons
why
I
changed
the
recording
rule
to
always
give
us
a
zero
and
so
I'll
need
to
reevaluate
that
now
and
see
if
that's
been
fixed.
B
Right,
I
was
actually
wondering
about
this
too,
and
I
it
looked
to
me
like
if
you
have
so,
if
you
assume
that
you
have
a
constant
trickle
of
data
points
with
one
minute
in
between
and
if
some
are
missing
and
you
take
the
sum,
then
I
think
you're
effectively
treating
it
as
if
the
rate
is
zero,
because
if
the
rate
was
zero,
then
they
also
contribute
nothing
to
the
sum
and,
if
they're
missing,
they
contribute
nothing
to
the
sum
yeah,
so
it's
indistinguishable
from
rate
zero.
I
would
I
expect
there
was
there.
C
Was
like
I
was
looking
at
the
because,
because
where
this
conversation
really
started
off
was
with
ben
and
ben
looking
at
the
prometheus
servers,
melting
down
running
these
six-hour
queries
and
saying
you
know
why
can't
we
just
use
average
over
time
and
me
saying
no
average
overtime
really
doesn't
work
like
look.
Here's
the
observed
data
and
here's
what
happens
with
average
over
time,
and
you
know
this
doesn't
match
up
and
then
kind
of
figuring
out
the
theory
from
the
practice.
C
But
then,
with
this
approach
you
know
I
looked
at
it
and
I
said
you
know
this
is
a
better
approach
for
for
kind
of
estimating
what
the
six
hour
rate
is
and
for
side.
Kick
it
where
at
the
time
for
sidekick,
it
worked
really
well,
but
not
for
everything
and
that's
why
in
the
metrics
catalog,
we
have
a
switch.
That
says,
like
I
think
it's
called
like
upscale,
longer
rate
queries
or
something
or
something
like
that.
C
It's
upscale
longer
something
something
and
that-
and
it's
only
enabled
for
sidekick
and
for
postgres,
because
those
are
the
two
that
we
have.
You
know
for
postgres
we're
using
all
the
rails
metrics
and
then
putting
them
together
and
I
need
to.
I
need
to
still
figure
out
why
we
can't
just
apply
that
generally,
because
for
some
of
them
it
was
like
out
by
a
few
percent.
C
Yeah,
because
and
and
if
we
want
to
do
the
three-day
one
it
because
I
don't
think,
there's
any
metrics
where
we
can
just
do
three
days
using
the
raw
data,
it's
just
it's
going
to
be
too
expensive,
and
so
we
have
to
fix
this
for
the
general
case
in
order
to
use
it
for
the
three-day
one.
So
so
it's
something
to
think
about.
B
Yeah,
not
let
maybe
we
should
just
sit
down
and
then
you
talk
me
through
yeah
I'll.
B
C
B
Which
is
which
is
great,
of
course,
yeah.
B
If
I
could
try
to
summarize
it
is
that
summing
rates
is
okay,
as
long
as
you
know
that
the
recording
rate
is
a
constant
time
in
between
more
or
less
constant,
it's
never
going
to
be
perfectly
constant,
and
also,
if
you're,
if
you're,
dividing.
If
both
the
numerator
and
denominator
all
use
the
same
recording
rates,
then
then
it
works
out
because
then
you
have
to
then
time
60
disappears
everywhere.
C
B
Yeah,
I
also
wanted
to
share
this
because
I,
I
think
all
of
us
are
sort
of
ambassadors
for
error
budgets
and
these
metrics,
so
I
don't
expect
any.
I
hope
nobody
will
mention
riemann
sums
outside
of
this
goal,
because
it
would
be
to
probably
alienate
the
next
person,
but
I
I
I
wanted
to
share
this-
to
have
more
intuitive
understanding
of
what
these
numbers
are
to
to
somehow
boost
that
for
everyone.
So
thanks.
B
B
C
B
I
also
remember
that
you
had
you
were
talking
about
figuring
out
how
to
get
the
calculation
right
when
you're
mixing
errors
over
time
like
error,
increase
and
latency.
Miss
increase,
and
you
have
these
different
kinds
of
data,
and
you
basically
need
to
need
to
do
a
weighted
sum.
You
can
do
that
with.
B
Or
you
don't
know
the
weights
and
the
moment
you
work
with
increases,
you
effectively
get
the
weighted
sum,
but
you
don't
have
to
figure
out
what
the
number
of
samples
was,
because
who
cares
what
the
number
of
samples
was?
If
you
can
calculate
them,
you.
C
Know
that
they
paid
yeah
yeah,
but
it
actually
may
be
something
we
should
document
somewhere
because
there's
probably
some
point
in
feature
where
people
say
oh
well.
Maybe
we
don't
need
these
two
things
to
be
evaluated
together
and.
B
That
was
my.
That
was
my
starting
point.
Like
bob
showed
me
this
formula,
and
I
thought
why
on
earth
are
we
not
doing
average
over
time
and
do
we
have
a
formula
that
makes
assumptions
about
a
prometheus
config
or
a
thanos
conflict
like
yeah?
Can
we
do
it
without
those
assumptions,
and
so
no
we
need
the
assumptions
but
yeah.
We
should
document
them
yeah.
D
D
The
which
ones
were
there
sorry
not
prepared
yeah
initializing
everything
for
every
psychic
job
that
they
could
run
on
a
cluster
rather
than
waiting
until
they
run
because
of
these
low
rate,
psychic
jobs,
and
basically
it
goes
from
about
10
000
metrics
at
the
moment,
to
about
40
000,
which
is
not
really
good,
but
it's
also
not
out
of
scope
with
others.
C
Does
it
is
it
all
around
it?
It's
not
put
some
dust,
there's
some
fairly
there's
some
fairly
well
publicized
documentation
on
sizing.
You
know
the
number
of
series
in
a
it's.
It's
in
the
pro
mio
docks,
but
you
know
the
the
the
kind
of
simplest
approach
would
be
to
look.
I
suspect
that
that
prometheus
db
is
probably
under
more
duress
and
so
to
look
at
where
we
are.
You
know
just
relatively
look
at
the
defaults
db
and
app
instances,
and
you
know
see
if
it's
if
it's,
if
it's.
D
C
D
D
Vms,
how
many
sidekicks
for
a
non-vm
still
seven
of
them
until
we
get
one
key
for
a
shot,
just
the
last
of
the
catch-all
which
we
desperately
wanted
to
turn
down.
So
we
could.
We
could
ignore
those
actually
and
they're,
not
the
problem,
because
there's
only
seven
of
them
so
you're
right,
it's
gke!
Isn't
it?
Is
there
a
metric
for
the
number
of
metrics
enough.
C
Yes,
it's
actually
under
if
you
go
to
prometheus
gke
I'll
just
share
my
screen,
quick.
C
So
prometheus
does
gke,
and
here
we
have
big
surprise.
Almost
all
of
our
top
10
are
sidekicking
sidekick
things,
so
they
have
under
the
status
thing
over
here.
They
have
like
a
bunch
of
things:
top
10
label
names
with
high
memory
usage
path.
That's
not
what
I.
D
C
The
number
of
series:
what's
that
two
million
okay,
so
it's
a
lot,
it's
actually
a
lot
higher
and
then
just
one
more
just
for
a
bit
of.
We
could
probably
throw.
C
Is
at
five
million
tk's
at
five,
and
that's
almost
certainly
because
of
the
pod
load.
You
know,
because
of
the
all
the
other
parts
creating
new
series
when
they
grow
in
and
out
of
existence,
so
wow,
okay,
so
so
my
my
guesstimates
were
totally
off,
so
prometheus
db
is
only
at
half
a
million.
So
it's
it's
cruising
prometheus
app
is
at
2
million.
B
C
B
Yeah,
but
just
because
they
could
have
happened,
doesn't
mean
that
we
know
they're
not
happening
because
otherwise
we
wouldn't
have
these
missing
metrics,
and
it
also
doesn't
mean
that
if
we
start
pre-populating,
this
doesn't
dip
over
the
prometheus
server
but,
like
you
said,
like
thirty
thousand
on
five
million,
is,
will
be
very
odd
if
that
pushes
it
over
the
edge.
Do
we
have
saturation
to
tell
us
if
that
five,
that
one
with
five
million
is
in
trouble.
C
No,
we
should,
and
just
I
just
quickly
do
now.
I
looked
at
the
defaults
as
well
and
that's
at
16
million
1.6.
Oh
one
point
my
eyes
are,
thank
you.
I
think
so.
Yeah
yeah.
C
C
B
Where
do
we
feel
the
pain?
Is
it
in
in
prometheus
itself
or
when
we
try
to
run
queries
across
a
lot
of
them?.
C
I
think
there's
there's
memory,
because
it's
got
to
keep
certain
things
in
memory
as
well.
It
might
you
know
the
indexes
to
where
other
things
are.
I'm
not
I'm
not
actually
totally
sure
about
it.
C
B
C
C
So
craig-
and
I
spoke
about
that
and
now
one
to
one
earlier
exactly
that
topic:
do
you
want
to
expand
on
it
a
little
bit
jakub.
B
Oh
so
it
might
be
a
natural
way
to
divide
up
the
prometheuses
to
have
one
if
there's
this
natural
division
there
anyway.
C
So
just
to
expand
on
the
idea
for
others
a
little
you
know
what
we're
going
to
have
is
we're
going
to
have
a
redis
we're
going
to
have
a
sidekick
cluster
per
zone
or
per
kubernetes
cluster,
and
so
that
we
can
kind
of
have
lots
and
then
each
one
will
have
its
own
redis
instance,
so
that
we
can
effectively
horizontally
scale
the
redis
behind
sidekick.
But
then
also
what
we
could
potentially
do
is
have
a
prometheus
instance
that
just
lives
alongside
that's
paired
with
that
cluster
and
collects
its
metrics
as
part
of
that
cluster.
C
B
Yeah,
I
think
the
one
of
the
reasons
to
do
this
might
be
that
we
just
saw
that
the
gke
prometheus
has
five
million
series,
and
we
said
that's
probably
because
of
pod
churn.
So
like
the
more
of
these,
if
it's
mainly
because
of
the
pause,
it
might
be
nice
to
also
just
if
that,
if
that
is
a
source
of
journey,
and
if
we
can
separate
them
that
might
be
yeah
and.
B
Well,
because
we
use
thanos,
it's
it's
perfectly
fine
to
have
multiple
prometheuses,
and
this
is
the
why
I
say
that
total
clusters
are
not
my
favorite
topic,
because
I
am
a
bit
nervous
about
the
application
impact
of
having
multiple
psychics
of
having
psychic
talk
through
different
registers,
but
the
same
postgres,
but
that's
a
different
topic.
But
I
think
in
the
case
of
prometheus,
I
don't
see
a
reason
why
it
wouldn't
work,
because
we
already
do
this
sort
of
stuff,
and
it's
mainly
about
like
operationally.
Is
this
something.
So
this
is
hard.
C
We
still
do
it
at
a
single
prometheus
level,
so
you
know
we
would
get
split,
brain
sli
alerts
and
that's
not
a
problem
because
all
of
the
well,
almost
all,
except
for
the
seven
that
that
craig
mentioned
or
all
of
the
all
of
the
the
jobs
are
being
evaluated
within
one
cluster,
which
is
the
the
regional
sidekick
cluster
and
they
have
then
they're
all
in
there,
so
they're
all
getting
evaluated
in
one
place.
C
So
it's
not
a
problem
if
we
split
it
up
like
that,
you
know,
jobs
could
run
in
one
of
three
clusters
and
we
just
have
to
do
exactly
like
what
we
did
with
the
rest
of
our
slis
and
move
them
so
that
they're
evaluated
in
thanos
at
a
higher
level
and
we've
got
the
reason
I
say
it's
low
risk
is
because
we've
got
all
the
tooling
in
place
for
that.
You
know
we've
got
aggregation
sets.
We
could
probably
just
set
up
another
aggregation
sets
for
that
exact
right,
but.
C
E
So
I'm
sorry,
I
cannot
turn
on
camera
right
now,
for
my
understanding
isn't
the
problem
with
having
sidekick
on
zonal
clusters,
the
fact
that
we
take
out
a
cluster
at
a
time
when
we
are
upgrading
which
could
mean
right
like
if
we
have
separate
reddish
per
zone
that
would
possibly
cause
additional
problems
like
we
would
have
to
have
a
master
of
main
radius.
That
would
be
able
to
like
compensate
for
the
fact
that
we
are
taking
out
two
zones
at
a
time.
E
C
Can
I
I
mean
I
I
wasn't
aware
of
that
argument
and
why
why
why
would
we
take
out
the
entire
cluster
while
we're
upgrading
it.
E
There
was
first
of
all,
there
was
a
challenge
with
net
portal
location.
If
I
remember
correctly,
we
used
to
do
it
in
one
go
and
then
after
we
had
that,
we've
realized
it's
actually
a
nice
feature
to
be
able
to
see
any
breakage
prior
to
rolling
out
to
the
rest
of
the
cluster.
So
we
always
have
some
capacity.
E
That's
all
up
for
rediscussion
right
like
it's
not.
We
were
like.
We
were
purposeful
there,
but
it's
something
to
keep
in
mind
when
talking
about
these
topics.
C
E
If
we
are
talking
about
look,
if
I
understand
correctly
what
you
were
talking
about
here,
like
you
were
saying
that
would
have
ready
set
up
personal
cluster,
my
understanding
is
that
if
we
take
out
the
whole,
you
know
zonal
cluster,
whatever
was
stored
in
the
redis
instance.
That
was
for
that
specific
zonal
cluster
could
cause
issues
elsewhere.
Right,
okay,.
C
C
Being
unavailable
right
yeah,
but
that
that
shouldn't
affect
redis
two
I
mean.
E
E
So
yeah
the
the
challenge
here
I'm
talking
about,
is
like
the
different
versions
that
we
would
run
you
know,
depending
on
the
cluster,
we
are
rolling
out
the
version.
The
new
versions
too
right
like
if
I
remember
correctly,
one
of
the
problems,
was
that
at
any
given
point
in
time,
one
of
these
clusters
could
be
running
a
different
version
of
well
gitlab.
Basically,
so
wouldn't
that
cause
already.
E
Right,
but
if
you
have
a
separate
radius
cluster,
you
would
sorry
separate
red
is
per
zone.
That
would
mean
that
the
new
version,
or
that
there
would
be
a
version
stored
differently
in
redis,
so
you
might
actually
end
up
having
a
situation
where
side
it
pulls
from
red
is
something
that
is
in
a
different
format.
B
E
B
Yeah
yeah,
like
a
deploy
like
a
deploy,
is
slow.
So
we
have
that's
true
jobs
from
there
submitted
by
old
versions,
picked
up
by
new
versions
and
the
other
way
around,
and
it's
it's
a
horrible
mess.
But
we
already.
C
B
C
Yeah,
just
I'm
just
kind
of
curious.
This
sounds
like
something
that
people
are
talking
about.
Is
there
an
issue
or
is
this
being
written
down
somewhere
because
it.
E
C
Because
I
I
I
would
push
back
on
that,
I
mean
I
I
you
know,
don't
I
don't
necessarily
agree
with
that.
To
be
frank,
you
know:
there's
yeah
I'll
and
I'll
respond
on
the
issue,
I'll,
try
and
find
it
or
I'll
ask
java
about
it,
because
I'm.
C
I
don't
think
that
that's
that
doesn't
it
doesn't
connect
for
me
at
the
moment
at
least.
B
I
think
I
see
a
general
problem
here
where
there
is
a
a
solution
which
is
this
redis
per
cluster
for
so
redis
per
kubernetes
cluster
for
sidekiq
and
different
problems
that
people
they
think
this
solution
solves.
B
And
then
you
get
a
disconnect
where
I
think
on
the
delivery
sides
from
what
I've
picked
up
from
jarv
there's.
This
idea
that
you
could
have
within
a
zone
you
could
have
unifor
like
you,
could
drain
an
entire
zone,
upgrade
gitlab
and
put
traffic
back
in,
and
then
you
no
longer
have
this
mess
of
different
versions
of
gitlab
talking
to
the
same
redis,
which
is
a
different
problem
from
scaling
having
to
process
lots
of
psychic
jobs,
which
I
think
is.
C
B
So
I
I
suspect
this
sort
of
confusion
is
going
on
here
and
I
I
think
that
whenever
we
talk
about
this,
we
should
also
be
clear
about
what
problem
we're
trying
to
solve
and
not
get
attached
to
a
solution
like
the
one
thing
that
always
makes
me
a
little.
Anxious
is,
if
I
see
people
getting
very
attached
to
a
certain
solution,
because
they
think
it
will
solve
their
problem
and
in
the
end,
you
end
up
with
something
that
doesn't
solve
any
of
the
problems
correctly
or
something
complicated.
E
This
this
sounds
like
a
perfect
opportunity
to
actually
pair
with
delivery
on
like
a
larger
discussion.
So
maybe
it's
it's
worth
like.
First
of
all,
starting
up
like
a
general
issue
to
talk
about
the
problems
here
and
then
pairing
with
delivery
to
see
whether
they're
we
need
to
do
some
architectural
changes
to
how
we
deploy
gitlab
and
what
is
else
necessary
to
add
to
our
infrastructure.
C
Yeah
I
mean
I
can
see
that
branches
is
also
of
like
totally
shutting
down
traffic
to
a
cluster,
but
there's
also
disadvantages
to
it
as
well.
Right
and-
and
so
it
makes
me
wonder
whether
making
that
decision
will
make
other
things
kind
of
more
difficult
in
future,
and
also
whether
that
means
that
we'll
miss
problems
that
customers
might
have
because
customers
aren't
running.
You
know
multi
companies,
clusters.
E
C
You
know
they
don't
have
that
they
don't
have
that
luxury
right.
E
But
at
this
point
we
also
have
to
take
care
of
what
gitlab.com
needs
right
like
if
we
can't
scale
further,
then
there
is
not
much
to
discuss
there
right
so
yeah
yeah.
C
E
E
E
A
We
should
give
them
like
after
the
release
is
asked.
I
think
they've
got
quite
a
lot
going
on
and
if
we
do
this
before
the
release,
I
don't
know
if
we'll
have
all
of
their
their
focus
on
this
particular
topic.
So
maybe
we
give
it
a
week
or
two
before
we,
we
put
it
in
their
their
agenda.
E
E
Drawing
a
blank
you
should
you
should
ping
javascript
for
that
one.
I
remember.
E
No,
that's
the
name
space
one.
I
don't
think
so.
A
Well,
in
the
interest
of
time,
I
see
that
marin's
got
another
item
on
the
agenda,
so
shall
we
should
we
hop
to
that
one.
E
Yeah,
if
you
don't
mind,
this
is
more
of
a
question
now
that
I
see
craig
is
around
like
we
see
the
recurrence
of
of
the
crown
worker
not
being
able
to
archive
the
trace
jobs
again,
so
we
are
in
danger
of
data
loss
so
to
speak
again.
E
Craig
you
mentioned
in
that
comment
that
this
is
a
pure
infrastructure
problem,
but
I'm
I
want
to
talk
about.
You
know
whether
that's
actually
the
case.
Yes,
I
know
the
fact
that
we
have
like
a
large
sidekick,
paul,
churn
and
like
this
is
becoming
a
theme
of
this
call
a
bit,
but
at
the
same
time
I
I
just
kind
of
wanna
discuss
whether
you
know
the
application
is
not
able
to
do
what
it's
supposed
to
do
in
this
new
environment.
Maybe
the
architecture
needs
to
change
as
well.
E
D
Cool,
so
the
very
simplest
quicker
summary
I
can
give
you.
We
have
some
sort
trace
chunks
get
put
into
redis
up
to
128khz
each
they
those
that
get
what
get
moved
into
object,
storage
and
then,
when
the
job
finishes,
those
objects
get
collated
into
a
final
artifact,
which
then
gets
put
back
into
object.
Storage.
D
D
Yeah,
not
the
things
went
really
really
horribly
wrong
in
the
first
place
problem
that
we
have
so
the
app
could
be
re-architected,
I
mean
if
we
could
do
more
of
those.
In
parallel
I
mean
the
the
slowness
with
the
trace
downloading
the
traces
from
object
storage.
Is
that
they're
done
serially?
You
know
we
do
one
after
the
other
after
the
other,
and
they
take.
You
know
some
number
of
milliseconds
each.
D
What's
it
probably
yeah
50
to
100
and
yeah,
it's
just
that
they're
happening.
It's
not
that
object.
Storage
can't
handle
the
throughput.
Is
that
we're
doing
it
in
a
single
thread?
So
if
we
could
bring
those
down
multiple
at
once,
we
could
reduce
that
by
factors.
You
know
two
or
three
times
would
probably
be
enough.
You
know
if
we
get
those
down
to
two
or
three
minutes,
we'd
get
a
much
better
chance
of
those
working.
D
It's
not
wrong.
The
only
other
way
would
be
to
pre-collate
them
in
smaller
chunks,
but
then
you're
sort
of
you're
just
throwing
data
around
over
and
over
again,
and
I'm
not
sure
that
would
be
you'd
just
be
fighting.
You
know
putting
together
10
of
them
and
then
putting
them
back
into
object,
storage
and
then
at
the
end,
putting
chunks.
That
seems
a
bit
weird
to
me.
E
So
I'd
like
to
take
like
a
page
out
of
andrew's
book
in
this
case
and
and
maybe
like
discuss
whether
10
minute
job
in
in
sidekick
is
something
that
we
want
our
application
to
do,
and
maybe
we
can
set
a
goal
of
what
we
can
offer
as
infrastructure
and
ask
development
to
tell
us
what
they
can
do
within
that
period
of
time
and
push
back
if
that's
not
doable.
E
So
you
know
like
it
feels
to
me,
like
the
long
running
or
stable
chart
suggestion
that
you're
making.
It
goes
completely
against
the
the
kubernetes
philosophy
there
right
so
like.
I
would
like
to
challenge
the
situation
a
bit
if
possible
and
if
only,
if
not
really
possible,
without
a
major
re-architect
we
go
back
to
yeah.
I
don't
know,
I
don't
want
to
go
back
to
vms,
but
the
vm
behavior.
D
Yeah,
I
I
just
said
I
mean
I
agree
with
not
with
having
to
be
more
resilient,
but
10
minutes
feels
like
a
very
slow
small
number
for
a
pod
to
live
when
none
of
them
seem
to
live
more
than
10
minutes.
I
mean
we're
just
we're
just
never
getting
it's
not
like.
Some
of
them
die
quickly.
It's
none
of
them
seem
to
or
all
these
jobs
are
ending
up
on
pods
that
never.
G
F
D
C
C
Yeah
and
and
the
problem
I
mean
I
I
agree
with
marin
like
we
should
start
pushing
towards
saying
you
know
you
can't
have
it
you
can't.
We
can't
scale
things
that
if
somebody
does
one
push,
we
go
off
and
do
an
hour's
worth
of
compute
on
that.
That's
just
not
gonna
work
that
doesn't
scale,
but
at
the
same
time
a
lot
of
these
things
are
what
they
are,
and
so
we're
gonna
have
to
have
a
lot
of
exceptions
to
that
rule,
because
they're
not
going
to
be
able
to
change
them
super
quickly.
A
With
as
well
because
we've
had
the
problem
in
the
past,
we've
I've
done
manual
intervention
on
this.
For
for
months
now,
we've
been
doing
this
manual
intervention,
and
now
we
thought
we'd
gotten
somewhere
and
we
haven't
go
back
to
where
we
started.
So
this
seems
like
a
decent
one
to
say
you
know
you
have
to
have
a
better
long-term
plan
for
this
in
the
short
term
we
can
do
x,
but
we
can
only
support
that
for
for
a
very
short
period
of
time.
A
B
Right
but
we
started
this
with
the
ci
trace
trunk
drops
right.
I
had
technical
trouble,
so
I
missed
part
of
the
explanation
of
what's
going
on,
but
is
it
a
throughput
problem
like
why?
Why
does
it
even
matter
that
these
things
get
terminated
after
10
minutes
they
get
scheduled
once
per
hour
and
they
expect
to
run
for
an
hour
and
they
only
get
10
minutes.
B
Which
we
could
do
is
a
single
job
processing.
The
artifacts
of
a
single
job
is
that
taking
more
than
10
minutes
and
it's
never
completing
some.
D
B
E
E
Related
to
the
project
expert
imports
just
want
to
kind
of
share
how
that
whole
thing
is
going
on
that
side.
So
we
have
this,
like
you
said
craig,
it's
the
same
problem,
basically
right
now,
the
challenge
there
is
the
the
feature
itself
is
outdated.
Architecturally
right,
like
it
needs
a
complete
re
work
because
it
was
already
not
reliable
in
vms.
E
So
we
are
right
now
looking
to
create
that
stable
shard
that
you're
talking
about
with
a
couple
of
pods
that
will
be
running
there
almost
exclusively
just
for
project
imports,
but
we
already
know
that's
not
going
to
work
like
we
already
know
this
is
going
to
fail
as
soon
as
you
have
like
more
than
four
or
five
parallel
long-running
migrations,
you
start
queuing
and
it
just
goes
out
of
hand
right
and
we
don't
have
infinite
capacity.
So
what
is
being
done
on
the
side?
Sorry,
I
know
just
to
finish
this
out.
E
What's
being
done
on
the
side,
is
we
are
buying
time
for
the
team
that
is
responsible
for
project
import
export
to
actually
redo
the
the
future?
So
this
way
we
can
support
them
for
a
tiny
period
of
time
while
they
get
ahead,
but
everything
else
is
stopped
right,
like
we're
not
taking
new
project
imports
like
they're,
failing
it's
known
and
so
on.
So
it's
a
bit
of
a
mess.
I
don't
want
us
to
get
there
with
this
situation.
So
that's
why
I
would
like
to
start
these
discussions
is
parallel.
E
Like
rachel
mentioned
right,
like
we
need
the
longer
term
plan
and
the
stop
gap.
The
total
issue
that
you're
asking
for
approval
that
can
only
last
for
a
bit
like
it
can't
go
forever.
You
know
sorry.
C
On
top
of
the
operational
administered
overhead
of
running
on
on
kubernetes-
and
you
know
the
more
we
allow
this
the
more
pain
we
kind
of
pushing
kicking
down
the
down
the
road,
because
you
know
there's
no
scaling
on
these
things.
There's
you
know
everything.
We
know
that's
wrong
with
it,
we're
basically
just
using
kubernetes
as
a
deployment
mechanism.
Then.
A
E
Well,
for
this
specific
one,
because
it
has
the
visibility
in
the
stand
up,
I
can
make
it
into
not
necessarily
a
rapid
action.
I
don't
think
it's.
We
need
a
repetition
for
this,
but
we
need
an
organized
effort
for
multiple
sides
to
get
this
thing
done.
So
what
I
could
do
is
if
you
have
a
bit
of
a
write-up
of
like
what
what
do
we
need
to
expect
from
development?
What
are
we
going
to
do
in
like
a
stop
as
a
stop
gap,
and
so
on?
E
I
can
go
with
that
and
explain
like
what
kind
of
effort
do
we
need,
and
maybe
it
doesn't
turn
hopefully
doesn't
turn
into
a
rapid
action,
but
more
of
a
larger
project
that
we
need
to
collaborate
with
others
on,
and
it
could
be
it.
This
is
a
scaling
problem
right,
so
this
this
is
what
we
need
to
do
as
well,
so
we
can
pair
with
the
team
that
is
responsible
for
this
to
design
a
new
yeah
design
redesign.
This
feature
basically.
C
E
That
that
makes
sense.
Let's
things
I
forgot
about
engineering
allocation,
I
still
don't
know
how
to
use
it.
I
think
that's
actually
a
really
good
approach
there.
So,
theoretically,
we
as
a
team
could
take
that
on
together
with
whoever
is
the
the
engineers
allocated
there
and
drive.
It
is
one
of
the
projects.
E
H
Sorry
rachel,
I
didn't.
E
Know
this
was
part
of
infradev
the
engineering
allocation.
I
only
got.
A
Into
the
engineering
this
week,
yeah,
it's
still
new
to
me-
I'm
also
learning
how
to
how
to
make
it
useful,
but
so
what
I'll
do
is
I'll
draft
craig's
last
comment
on
this.
One
issue
is
about
being
firmly
with
infrastructure,
so
I'm
going
to
draft
a
reply
to
that,
but
I
will
yeah
I'll
draft
it
and
then
send
it
back
to
craig
just
to
check
that
my
statements
are
right
and
then
we'll
try
to
push
it
forward
as
a
project
using
that
engineering
allocation.
B
E
If
we
can
stop
the
recording,
I
can
say
the
real
thing.
Otherwise,
it's
a
way
to
actually
get
some
architectural
changes
done
through
outside
of
the
regular
scheduling
that
product
does
so
outside
of
the
it's
productive.
B
C
No,
my
understanding
is
that
the
engineering
department
has
got
an
allocation
that
that
they
can
use
for
longer
term
strategic
initiatives
to
fix.
So
I,
in
my
mind,
infradev,
is
quite
tactical
and
it's
quite
like
we
need
to
fix
this
thing
now
and
we're
gonna
scream
a
lot
until
it
gets
fixed
engineering
allocation
is
the
next
rung
up
where
there
is
a
certain
proportion
of
headcount.
C
That's
that's
dedicated
to
addressing
kind
of
strategic
problems
like
off
the
top
of
the
head
on
top
of
my
head,
like,
I
would
imagine
you
know
the
object,
storage
problems
around
file
upload
and
all
of
that
would
be
a
great
engineering
allocation
problem.
It's
not
something
that
you
can
address
really
short
term
like
you
know
in
the
next
three
releases,
and
it
needs
devoted
kind
of
engineering
and
and
christopher's
managed
to
get
an
allocation
of
headcounts
that
can
work
on
those
projects
and
it
moves
around
between
teams.
C
F
It
sits
in
between
infradev
and
rapid
actions
like
it's
faster
than
infant
f,
but
not
as
do
it.
C
Now,
as
reproductive
no,
it's
it's
slower
than
infra
dev,
it's
slower
than
infra
dev
and
slower
than
rapid
action.
So
it's
the
third
tier.
E
It's
a
different
optimization,
it's
an
optimization
to
skip
the
whole
prioritization
process.
That
goes
to
multiple
milestones
where
you
need
to
go
through
a
product
manager
continuously.
This
is
more
of
you
have
a
dedicated
set
of
people
who
can
actually
do
the
work
and
schedule
the
work
themselves
to
skip
a
couple
of
levels
right,
yeah
and.
C
C
And
the
the
the
kind
of
status
updates
of
that
is
now
in
the
tuesday
infradip,
so
input,
even
an
engineering
allocation
happen
together
in
that
tuesday
evening
meeting
and
so
the
first
part
of
the
meeting.
We
talk
about
infra
dev
issues,
and
then
we
talk
about
engineering
allocation
and
what
you
obviously
find
is
there's
a
lot
of
overlap.
So
some
of
the
stuff
that's
been
raised
is
infradev.
It's
like!
Oh
well.
C
Actually,
that's
going
to
be
this
engineering
allocation,
that's
underway
or
the
other
thing
that
might
happen
is
you
know
we're
seeing
this
common
pattern
over
and
over.
You
know
like
I'm,
going
to
use
a
file
upload
example
again
we
have
all
this
different
pain
because
of
carrier
wave,
but
it's
not
something
that
you
know.
One
team
is
going
to
go.
C
E
So
I
I
have
an
item
for
tomorrow
for
the
sauce
stand
up.
So
if
anyone
can,
let
me
know
like
whether
we'll
have
a
tiny
bit
written
up,
so
I
can
actually
share
otherwise
I'll
move
the
update
for
monday
and
and
try
to.
E
Yeah
and
craig
also,
I
really
appreciate
the
right
that
you
did
in
that
issue
right
like
you,
you,
it
was
really
super
clear
what
is
happening
there
and
I
do
appreciate
your
help
helpfulness
to
say
that
this
is
an
infrastructure
problem,
but
as
someone
who's
been
in
infrastructure
for
a
while,
I
realize
it's
very
rarely
only
an
infrastructure
problem.