►
From YouTube: 2023-02-02 Scalability Team Demo
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
B
Yes,
so
this
came
up
in
kind
of
a
fireside
chat
that
Stephanie,
Matt
and
I
had
a
couple
days
ago,
where
Matt
and
I
had
a
pretty
similar
idea
of
how
we
thought.
Multi-Store
works
for
migrating
data
and
Stephanie
refuted
both
both
of
us
and
it
doesn't
work
the
way
that
we
thought
it
works.
B
So
the
the
mental
model
that
I
had
was
that
we
would
have
kind
of
a
four-phase
rollout
where
we
we
first
write
to
the
old
data,
store
and
read
from
the
old
data
store.
Then
we
enable
dual
rights.
So
we
read
from
old.
We
write
to
both
we
kind
of
leave
that
running
and
warm
things
up.
B
B
B
And
so
you
don't,
you
have
no
way
of
enabling
dual
rights
without
reading
directly
from
the
new
data
store,
and
also
you
can't
independently
say
where
the
reads
are
coming
from
the
reads
will
always
go
to
new
and
then
fall
back
to
old,
which
is
pretty
surprising
Behavior,
or
at
least
it
was
to
me,
and
this
issue
kind
of
talks
through
some
of
the
consequences
of
this
design,
which
I
think
for
the
the
original,
like
initial
migration
that
made
sense
and
I
think
that
was
fine,
but
as
we're
sort
of
getting
more
latency,
sensitive
and
also
more
kind
of
consistency,
sensitive
workloads,
it
may
be
time
to
rethink
the
design
and
there's
a
few
ideas
outlined
in
the
issue.
A
B
It
might
have
been
rate
limiting
or
it
or
it
might
have
been.
What
was
the
other
one
I
think?
What
sessions.
C
D
C
B
That's
that's
part
of
it,
so
General
like
warm-up
period
and
I
I,
guess
generally
speaking,
sort
of
backfill
of
all
data.
If
we
need
to
do
that
right.
So
if
we
do
like
proactive
migration
of
data
which
or
say
persistent
or
shared
state
will
likely
want
to
do
something
like
that.
B
But
I
think
the
other
piece
is
in
in
gaining
confidence
is
to
actually
do
some
analysis
on
the
data
that
was
migrated,
so
you
can
sort
of
leave
it
running,
and
then
you
can
snapshot
both
the
old
and
new
and
kind
of.
Compare
the
data
and
say:
is
this
a
strict
subset?
B
C
B
C
B
A
Too
bad
the
occupies
here,
because
he
was
mentioning
moving
multi-store
into
something
that
is
configuration
so
when
we're
working
on
this
anyway,
I
think
having
the
option
to
do
to
leave
the
reads
where
they
are
while
having
dual
rights
is
something
we
need
to
consider
in
that
design.
A
B
Yeah,
so
if,
if
any
of
you
have
any
more
thoughts
on
this,
please
do
leave
them
in
the
issue.
A
A
Mentioned
it
like
that
I'm
kind
of
surprised
that
wasn't
the
approach
we
went
with
like
I
was
in
the
initial
discussions
and
then,
when
I
reviewed
the
merge
request
of
the
multi-store
before
it
was
ever
used.
I
thought
like
yeah.
That
makes
sense,
but
how
did
you
mention
it?
Otherwise,
I
think
oops.
A
Yeah
so
recently
the
data
I
think
Igor
brought
me
into
a
discussion
with
the
database
team
and
they
were
looking
into
saturation
point
for,
like
in
teachers,
running
out
like
yeah,
on
the
primary
keys
that
have
a
not
a
big
end
as
their
their
primary
key,
and
they
were
working
down
that
already.
A
But
they
were
surprised
that
we
had
notified
them
that
they
were
Across,
the
Threshold
that
they
had
in
their
mind,
because
that's
a
saturation
point
that
we
do
monitor
in
timlad
and
out
of
that,
I
restarted
the
work
on
using
the
soft
SLO,
so
the
soft
threshold
for
capacity
planning
and
then
leaving
the
hard
threshold
for
only
monitoring.
Let
me
show
you
where
I
got.
A
A
Yeah
31
new
capacity
planning
issues
but
I
think
that's
just
a
trigger
for
us
to
set
the
thresholds
correctly,
because
currently,
we've
just
used
the
hard
SLO,
which
means
we
would
create
a
capacity
planning
issue
as
soon
as
we
think
we
might
alert
the
SRE
on-call
at
some
point
in
the
future.
But
for
a
lot
of
saturation
points,
we
want
to
start
work
before
then,
specifically,
the
the
integer
overflow.
One
is
months
of
work.
F
A
What
the
problem
is,
the
thing
is
that
it's
coupled
that
the
the
thresholds
are
coupled,
we
have
the
alerting
threshold
that
is
going
to
page
an
SRE
on
call
and
they
need
to
solve
a
problem
right
now,
like
especially
for
this
one,
and
if
we
saturate
here
it
is
going
to
be
loads
of
hurt
and
I.
Don't
know
what
the
easy
solution
is.
So
the
main
thing
that
I
want
to
fix
is
decouple
the
threshold
that
we
set
for
hey
team.
D
Yeah,
it's
just
I
mean
those
names.
Are
you
know
they're
just
you
can
blame
me
for
them,
but
they
never
really
made
a
lot
of
sense
like
hard
and
soft
and
I'm.
Well,
on
the
one
hand,
it
makes
sense
to
use
both
I
think
it's
a
perfectly
good
change,
but
maybe
we
should
just
rename
them
to
alerting
threshold
and
capacity
planning
threshold,
because
then
it's
one
less
thing
that
people
have
to
understand.
A
A
C
B
The
current
behavior
before
we
change
this
to
soft
is
that
it
looks
that
it
projects
when
we
would
reach
the
alerting
threshold
three
months
in
three
months.
Right,
yes,
is
that
correct?
Yes,.
A
B
E
A
We
need
to
I
think
we
need
to
go
through
some
of
those
that
have
been
set
so
like
just
make
a
because
there's
31
issues
so
31
saturation
points,
maybe
less,
because
that
could
be
issues
separated
by
Service.
So
we
just
need
to
go
through
them
and
see
I
wouldn't
change
them
all,
but
the
ones
that
would
change
that
would
create
an
issue.
I,
don't
think
we
should
just
blindly
set
them
all
to
match.
In
my
opinion,.
A
F
We'll
share
it
and
see
what
happens
because.
F
A
A
A
Some
recordings
that
we
use
to
display
on
dashboard
and
so
on
so
Global
recordings
that
happens
in
that
happen
in
tunnels
ruler,
don't
match
their
actual
Source
metrics,
and
this
is
an
example.
So
here
we're
using
I'm
not
using
the
error
budget
metrics
here,
because
I
thought
yeah
I
was
just
looking
at
something
different,
and
this
is
what
popped
up.
A
So
the
green
line
here
is
the
component
obstrate
recorded
in
the
separate
prometheuses
and
then
the
the
monitor
Global
one
is
what
what
we
actually
end
up
displaying
in
places
in
dashboards,
and
this
is
also
the
kind
of
recordings
that
we
use
for
error
budgets.
So
the
problem
happens
there
as
well,
and
you
can
see
these
huge
drops
and
they
don't
match
what
the
permit
is.
A
A
That's
the
theory.
Yeah
I'm.
D
I'm
sure
it
must
be
that
right
like
it,
it
will
be
surprising
if
it's
not
that
you
get
logs
for
that.
Hey.
D
Do
you
see,
but
do
you
like?
Does
the
log
not
have
the
recording
rule
name
in
it
that
you
could
kind
of
filter
it
down
by
perhaps.
A
B
A
A
B
A
And
there's
one
that
I
go
ahead:
Andrew
go
first,
just.
D
Just
before
we
go
on
because
that's
I
think
it's
a
kind
of
important
Point
like
if
you
go
read
the
Thanos
documentation.
It
very
much
says,
like
only
use
partial
the
partial
strategy.
If
you
know
exactly
what
you're
doing
and
like
you
probably
don't
want
to
do
this
and
there's
a
lot
of
that
kind
of
in
the
documentation
which
is
right
and
when
we
were
doing
this,
it
was
kind
of
like
our
choices,
are
either
error
or
worn.
D
And
if
we
had
error,
then
kind
of
because
at
least
you
know,
if
one
Prometheus
drops
out,
everything
else
still
gets
evaluated
at
you
know
the
level
at
which
it's
at
it
just
kind
of
gets
excluded,
and
it's
better
for
us
to
still
do
the
SLO
evaluation
rather
than
just
not
evaluating
during
that
period,
and
so
it
was
kind
of
like
a
like
the
least
evil
kind
of
option
available
to
us
is.
A
D
D
So
the
thing
is
that
that
you
know
the
problem
is
that
if,
if
we
had
like
one
pre
Prometheus
server,
it
doesn't
even
need
to
get
any
data
on
it
right
like
it
just
has
to
be
malfunctioning,
it
doesn't
even
have
to
have
this
particular
data
on
it.
You
know
one
host
out
of
hundreds.
Well,
not
hundreds,
maybe
many
dozens
is
out.
Then
then
we
don't
get
any
metrics.
A
D
A
D
A
D
We
should
be
investigating
that
we
we
should
be
making
ourselves
more
immune
to
it,
but
we
should
also
be
investigating,
because
that's
a
that's.
You
know
if
it
was
a
pre-server,
you
wouldn't
see
much
change
right,
a
pre-server
or
or
a
staging
server
would
be
a
tiny
little
dip
in
the
metric
doesn't
matter,
but
that's
a
big
chunk
of
data
that's
falling
out,
and
that
means
that
it's
one
of
our
proper
servers
from
production,
that's
now
becoming
invisible
for
a
five
minute
period.
So
we
should.
We
should
definitely
investigate
that
as
a
separate
issue.
A
Yeah
one
other
thing
that
I
this
is
not
directly
related
to
this,
but
this
is
a
little
bit
how
my
understanding
of
recording
rules,
Works
everything
in
the
group
gets
evaluated
at
once,
and
we've
separated
the
pitch
category
metrics
from
the
component
Ops
once,
but
both
use
the
same
in
some
cases,
SLI
aggregation
rules,
and
we
probably
knows
what
I'm
talking
about
I,
don't
know
if
the
others
do,
but
those
get
recorded
in
a
separate
group.
Would
that
also
cause
incorrectness.
A
D
Sorry
I
was
thinking
like
I.
Don't
think
that
the
group
just
kind
of
dictates
the
pace
at
which
they
run
because
they're
always
running
in
sequence
inside
a
group
and
then
it
goes
back
to
the
beginning
and
starts
again,
but
I.
Don't
think
it
if
it
fails
like
halfway
through
the
group.
I,
don't
think
it
kind
of
short
circuits
the
rest
of
the
rules
or
is
that
what
you're
talking
about.
A
F
D
Yeah
the
the
evaluate
like
the
evaluation
times,
definitely
and
and
and
obviously,
if
the
if
say,
you've
got
it
every
30,
it's
just
a
it's
just
a
loop
right,
it's
basically
a
while
loop
and
it
and
and
then
it's
you
know.
If
it's
running
every
five
minutes,
then
it
will
run
and
it
takes
three
minutes
and
it'll
run
for
three
minutes
and
wait
for
two
and
it'll
start
again.
But
if
it
runs
for
six
minutes,
then
it
will
just
run
you
know
over
and
over
and
over,
because
that's
just
how
the
loop
works.
D
If
you,
if
you
get
but
I,
don't
think
it
it
there's
no
atomicity
to
those
groups.
Okay,
yeah.
A
A
A
Rules,
let
me
show
it
in
code,
perhaps.
D
D
D
Have
yeah
the
reason,
that's
very
specifically
in
the
same
group
so
that
you
don't
use
stale
data
because
the
the
order
in
which
they
evaluated
is
always
from
top
to
bottom.
And
so
you
know
if
they
were
on
two
separate
loops,
the
one
you
know
that
could
be
kind
of
off-kilter
from
one
another
where
that
one
is
being
evaluated
immediately
after
the
thing
that
uses
it
and
then
you're
always
getting
data
that
one
minute
further
out
of
date.
So
having
them
in
that
in
that
order,
kind
of
prevents
that,
yes,.
A
D
A
D
Are
actually
yeah
they're,
not
in
the
same
group,
so
they
so
so
those
ones
could
be
staggered
like
an
actually
delayed,
which
is
not
necessarily
good.
It
obviously
the
longer
period
that
you're
expecting
to
use
them
the
less.
That's
a
problem
like
if
this
you're
using
to
evaluate
over
a
30-day
period
Then,
you
know
one
minute
delay
on
your
on
your
metrics
doesn't
make
any
difference.
But
if
you're
using
it
to
alert
on
immediate
issues,
then
it
can
be
a
problem.
A
A
I'm
going
to
because
it
works
it
worked
before
and
like
now,
we're
looking
into
it
and
I
noticed
this
so
I'm
going
to
postpone
that
for
a
while
another
question
that
I
had
here,
we
have
these
aggregations
in
Prometheus
and
they
have
a
type
label
in
their
aggregation.
This
means,
if
you
specify
a
type
label
in
the
aggregation
and
no
selector.
F
A
A
D
A
D
Yeah
yeah,
so
there
you've
got
the
you've
got
the
that
same
I
mean
the
only
other
thing
you
can
do
is
just
maybe
double
the
speed
at
which
you
evaluate
the
the
intermediate
ones
and
kind
of
you
know
hope
for
the
best
you.
A
D
Yeah
because
I
mean
you
can
turn
them,
I
think
there's
like
a
flag.
You
could
turn
them
off.
So
if
you,
if
you're
feeling
adventurous,
you
could
turn
it
off
briefly
and
see
how
I
mean
I.
I
think
that
the
first
thing
we
need
to
do
is
kind
of
make
sure
that
Thanos
and
Thanos
ruler
in
particular,
is
firing
on
all
cylinders
and
not
like
my
car
and
and
then
you
know.
Maybe
we
also,
if
you
could
scale
ruler
up
horizontally.
D
D
But
it
is
a
I
mean,
you
know.
If
you
look
at
the
documentation,
they
say
you
can
scale
these
things
horizontally,
so
you
know,
have
we
just
got
one
because
it
was
the
default
and
something
we
should
be
thinking
about
and
then
get
rid
of
all
this
complexity
around
these
intermediate
recording
rules
and
the
jaggies
that
they
give
us
and
all
of
that
stuff
and
just
instead
of
having
one
Thanos
ruler,
we
have
100
Thanos
rulers
and
we
don't
even
have
a
problem
anymore.
D
A
D
B
It
I
was
gonna,
say
we.
We
have
a
similar
issue
with
Thanos
compact
as
well,
where
we
kind.
B
Scale
that
out,
if
possible.
So
if,
if
we're
gonna,
think
about
how
to
structure
one
of
the
sort
of
monolithic,
Thanos
components,
we
can
maybe
think
about
it
in
a
bit
of
a
broader
way
as
well.
D
D
A
B
Have
okay,
so
the
next
item
is
something
that
Jacob
brought
up,
which
is
related
to
how
we
manage
our
redis
and,
in
particular
our
Sentinel
configs,
on
the
M's.
B
B
We
try
very
hard
to
coordinate
these
reconfigures,
and
so
we
we
sort
of
get
to
repopulate
the
state,
but
there's
there's
certain
edge
cases
where,
if
we
were
to
new
kit
on
multiple
Sentinels
at
the
same
time,
then
we
would
get
in
a
really
bad
State
and
Jakob
built
a
test
case
for
this,
which
I
am
going
to
share
now.
B
So
this
test
case
has
two
redises
and
three
Sentinels
and
it
starts
the
Sentinels
off
with
an
empty
config
and
the
master
in
the
Sentinel
line
is
mismatched
with.
Who
is
actually
the
master
as
in
which
red
is,
does
not
have
a
replica
of
line
in
its
config
and
so
I'm
going
to
run
this,
and
so
it's
going
to
take
a
few
seconds
to
start
up.
B
B
So
yeah
Sentinels
take
a
bit
to
kind
of
converge,
but
now
we've
got
the
Sentinels,
don't
detect
the
current
state
and
say
we
have
one
master
and
one
replica.
This
is
fine.
Instead,
they
say:
oh
well,
we
have
a
master,
but
it's
not
the
one.
I
thought
it
should
be
that
that
one
is
actually
a
replica.
B
Let
me
perform
a
a
failover,
but
it
can't
do
a
failover
because
it
doesn't
know
about
the
replicas,
so
part
of
the
state
that
Sentinel
stores
is
known,
Sentinels
and
known
replicas,
and
in
this
case
my
suspicion
is
that
because
it
doesn't
have
any
known
replicas,
it
can't
do
the
failover.
B
So
what
I
changed
in
this
test
case
is
to
actually
write
those
known
replica
and
known
Sentinel
lines
to
the
config
file
and
and
basically
that's
sort
of
the
the
proposed
change
to
how
we
manage
our
Sentinel
configs,
and
so
we
can
see
how
that
changes.
The
behavior
under
this
specific
case,
and
hopefully
it's
going
to
go
a
little
better
and
not
leave
us
in
a
state
where
the
cluster
is
broken.
Right
like
this,
the
situation,
this
situation
is
bad.
B
We
don't
have
a
working
redis
setup,
so
let's
I've
checked
out
this
new
branch,
which
writes
those
configs
to
The
Sentinel
.com
file.
B
And
so,
let's
see
what
happens-
and
this
is
also
logging
redis
log
lines.
So
it's
a
little
little
more
verbose
than
the
example
that
we
had
previously.
B
So
we
have
a
primary
and
we
have
a
replica.
And
so,
let's
see,
let's
see
what
happened
so
again.
Sentinel
decides
okay,
we
should
perform
a
failover,
but
this
time
it
actually
successfully
performs
the
failover.
B
So
there's
a
promotion
and
let's
see
if
the
yes
and
so
now
we
get
into
a
really
strange
state
where
we've
promoted
red
is
two.
B
But
red
is
one
where
I
think
we've
promoted
redis
one,
but
then
one
of
the
Sentinels
tries
to
reconfigure
redis
one
to
be
a
replica
of
itself,
and
so
we
get
into
this
weird
state
where
it's
like
I'm
trying
to
connect
to
myself
but
for
some
reason
I
can't-
and
it
keeps
doing
this
for
like
10
20
seconds,
at
which
point
the
Sentinels
realized
that
there's
no
primary
and
will
initiate
another
failover,
and
so
they
performed
that
failover,
at
which
point
we
have
a
primary
but
red
is
one,
is
still
trying
to
connect
to
itself.
B
So
we
can
see
the
thing
yes,
so
it's
still
trying
to
connect
to
itself.
So
this
Loop
is
still
ongoing
right
and
then
eventually
it
takes
about
a
minute
for
this
Sentinel
realizes.
Oh,
we
have
this
note
that
is
trying
to
connect
to
itself
that
shouldn't
be
happening.
It
should
be
a
replica
of
this
other
node,
and
so
it
does
this
fixed
replica,
config
command
and
now
we're
finally
in
a
stable
state.
B
So
it's
super
weird
and
it's
not
ideal,
but
it
does
eventually
converge.
So
I
I
would
argue
that
this
is
probably
better
than
where,
where
we
were
previously,
but
it's
still
kind
of
scary
that
redis
has
these
these
weird
edge
cases.
D
Yeah
it's.
It
reminds
me
of
every
time.
I
see
people
using
like
a
chef
template
to
manage
the
sentinelon
conflict,
and
you
know
that
it's
gonna
end
in
I
think
the
gitlab
main
Chef
one
does
that
doesn't
it
it's
not
it's.
B
D
Yeah,
which
is
terrifying.
B
D
D
C
B
E
B
We
we
probably
want
to
move
this
to
kubernetes.
Hopefully
the
the
bitnami
redis
Helm
chart
actually
does
what
I'm
proposing
to
add
to
omnibus.
So
there
is
some
prior
art
there
as
well,
really.