►
From YouTube: Scalability Team Demo - 2021-07-15
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
A
So
we
had
an
incident
yesterday
or
I
don't
know
the
beginning
of
the
week
and
what
happened
was
a
bunch
of
jobs
got
dropped
on
them
on
a
sidekick
queue,
and
that
sounds
very
familiar
familiar
to
incidents
we
had
last
year
and
the
reason
this
started
happening
where's
the
incident
issue.
I
want
that
open
as
well.
A
So
during
the
incident
we
saw,
this
is
the
comment
I
was
referring
to.
That
made
me
bring
up
the
the
issue.
So
what
happened?
Was
there
was
a
user
that
did
some
some
transfers
of
projects
or
something
like
that,
something
that
requires
recalculating
authorizations
and
and
that
scheduled
a
whole
bunch
of
jobs
at
the
same
time
that
we
that
sidekick
would
need
to
get
through.
A
We
had
this
pro
this
problem
in
the
past
and
and
like
a
workaround,
for
that
was
like
make
sure
that
these
jobs
are
idempotent
and
de-duplicate
the
jobs
when
they're
being
scheduled
at
the
same
time.
So
if
a
job
is
in
the
queue
for
a
to
recalculate,
the
authorizations
of
the
user,
and
we
try
to
enqueue
another
job
for
the
same
user,
we'll
just
skip
the
the
next
one
and
rely
on
the
single
job
that
is
already
in
the
queue,
because
that
would
be
the
same.
A
A
So,
as
you
can
see
on
this
graph
on
my
screen
here
and
we
dropped
all
of
these
jobs.
That
would
that
had
this,
this
feature
to
read
from
the
replica
enabled
and
we
dropped
all
these
jobs
on
the
queue
and
we
saw
that
there
were
like
yeah
a
whole
bunch
of
them
that
were
duplicates.
That's
the
purple
line,
and
then
the
this
other
colored
line
is
all
the
jobs
that
were
the
first
one
for
a
single
user
to
be
scheduled.
A
So
if
we
were
able
to
deduplicate
jobs
still
and
then
that
would
have
like
we've,
this
would
be
the
the
amount
of
jobs
that
we
had
to
process,
but
instead
we
were
doing
the
sum
of
both
lines.
And
so
that's
why
I
suggested
to
manoy
who,
who
added
this
this
worker
in
the
first
place
a
way
we
can
have
our
cake
and
eat
it
too,
and
by
using
a
different
way
of
deduplication,
come
on.
A
By
not
using
the
the
replica
feature
that
checks
the
times
that
that
checks,
if
a
replica
is
up
to
date,
because
we
know
we
schedule
these
these
jobs
in
the
future
like
a
little
bit.
So
we
can
actually
still
duplicate
them
in
this
case,
and
I
suggested
a
different
way
to
make
the
queries
go
to
replica
without
having
to
check
the
replication
delay
of
the
replica.
A
Yeah,
regarding
that,
I
just
think
monoi
on
the
on
the
issues,
since
I
was
involved
in
the
murder
quest
and
so
on,
and
I
added
it
to
the
rapid
action,
but
I'm
not
quite
sure
that's
the
correct
process
to
do
that.
A
B
Yeah,
I
think
I
think,
since
it's
I
think,
since
it's
already
part
of
the
work
that's
been
identified,
it's
it
contributes
to
helping
resolve
that
and
I
think
because
it's
like
it's
framed
in
the
sense
of
the
six
things
that
that
were
already
recommended.
So
I
think
that
that's
that
seems
appropriate
and
it
looks
like
it
will
have
a
significant
difference
on
the
on
the
work.
Yeah.
A
Especially
since
we
we
know
this
worker
already,
we
know
the
work
to
be
done.
We've
been
through
this
and
part
of
that's
kind
of
on
me,
because
I
did
review
that
merger
quest
and
yeah.
I
didn't
think
that
I
didn't
connect
the
dots
that
this
is
the
same,
the
same
work,
even
even
if
it's
scheduled
differently
so
yeah.
B
Oh,
it's
been
challenging
because
there's
been
quite
a
lot
of
change
on
those
workers
already,
like
they've,
been
making
such
good
progress
about
improving
the
state
of
the
workers.
I
think
it's
just.
We
can't
expect
to
catch
absolutely
everything,
but
at
least
we
know
that
this
is
another
thing
that
can
be
done
to
try
and
resolve
the
problem.
That's
happening
there.
A
There
is
also
an
issue
open
for
allowing
deduplication
for
all
kinds
of
jobs
that
have
this
data.
Consistency
delayed
set
on
them,
but
this
is
a
quicker
win
for
a
job
that
we
know
schedules
a
lot
of
duplicates
in
some
cases.
So
I
think
it
makes
sense
to
prioritize
that
one
in
it
like
a
special
case.
A
B
A
Right
the
fun
like
the
recently,
we
started
calculating
error
budgets
for
feature
categories
and
stage
groups,
and
this
has
been
great
because
people
are
starting
to
improve
their
features
and
what
I
think
is
even
more
cool
is
that
people
started
to
contribute
to
the
sli's
themselves
in
the
run
box
repository
in
the
service
catalog.
A
So
this
only
currently
only
works
for
groups
that
have
like
a
feature
category
that
is
mapped
to
to
a
service.
For
example,
the
the
pages
slis
already
have
a
feature
category
marked
on
them
and
that
contributes
to
their
budget
of
the
package
package
group
release
group
yeah.
The
release.
A
And
so
they
are
interested
in
improving
the
visibility
of
that
service
and
and
they
are
defining
slides
and
improving
sli
that
we
already
have
by
changing,
adding
metrics
changing
the
the
service
definition
together
with
us.
This
is
not
something
that's
currently
possible
for
everybody
that
is
contributing
to
the
to
the
rails,
monolith
because
and
yeah.
A
That's
like
a
single
service
and
several
feature
categories
come
out
of
that,
and
what
we're
talking
about
in
this
project,
five
to
five
is
making
a
generic
way
for
people
to
add
metrics
that
would
feed
into
their
error
budget
and
allowing
them
to
customize
it
on
the
fly.
The
first
one
that
we
would
be
working
on
is
the
request
duration,
which
is
currently
capped
at
one
second,
so
a
request,
that's
faster
than
one
second
gets.
It
gets
a
point
for
a
good
app
text.
A
The
other
ones
get
the
other
ones
get
binged
for
that,
but
that
doesn't
apply
for
like
that.
That's
not
correct
for
all
requests.
For
example,
getting
a
jwt
token
should
be
way
faster
than
one
second
and,
on
the
other
hand,
getting
the
trace
of
a
job
or
a
long
pole
or
whatever
can
be
slower
because
people,
the
users,
aren't
waiting
for
it.
A
So
in
this
project
we
would
allow
the
stage
group
to
define
these
thresholds
themselves
on
on
endpoints,
but
while
we're
doing
that,
we
would
also
set
up
a
framework
for
them
to
add
new
slis
to
the
services
we
have,
and
so
we
can.
We
from
the
infrastructure
side
of
things
can
monitor
things
that
users
care
about,
because
product
knows
what
users
care
about
and
that's
the
gist
of
it
well
just
kind
of
a
ramble.
B
Yeah-
and
I
already
see
that
there's
comments
on
there
from
people
on
the
the
product
development
side,
so
I'm
I'm,
I'm
also
quite
interested
to
see
how
we
can
get
this
one
going.
While
it's
not
just
about
you
know
agreeing
that
some
people
can
have
a
slower
response
time,
because
the
customers
don't
mind
so
much
about
it.
B
It
drives
home
more
about
how
they're
using
the
system,
but
we're
able
to
engage
in
a
conversation
to
be
able
to
say
well,
you
know
we
can
increase
it,
but
this
is
how
we're
going
to
have
to
do
it,
and
this
is
what
we
can
cope
with,
and
this
is
what
we
shouldn't
be
coping
with,
whereas
at
the
moment
there's
not
really
much
of
that
conversation,
because
it's
a
case
of
well,
it's
one
second
or
it's
nothing.
A
C
A
Like
right
now
we
have
that
request,
duration,
sli
that
doesn't
get
factored
into
the
stage
groups,
because
the
cardinality
would
explode
and
we
can't,
but
that
one
has
two
thresholds,
one
second
and
ten
seconds
I
think
we're
going
to
add
validation
inside
the
rails
application
just
like
we
have
for
future
categories
on
on
endpoints.
If
a
feature
category
doesn't
exist
anymore,
the
test
will
fail
here.
If
you
define
a
threshold
that
is
longer
than
what
we
say
it
can
be,
then
the
test
will
fail.
A
B
And
the
way-
and
I
was
also
thinking
that,
in
order
to
get
a
an
sli
change
sort
of
approved
into
the
system
that
it's
going
to
come
through,
like
it's
got
to
come
through
a
conversation
with
us,
so
if
people
want
to
increase
that,
you
know,
let's
talk
about
it,
let's
engage
with
them.
B
Let's
see
what
is
a
reasonable
value
to
set
for
exactly
that,
like
having
everyone
set
it
to
10.
It's
just
not
it's
not
what
we
want
them
to
do.
Yeah.
I.
A
Think
so
I
think
that
threshold
should
go
to
us.
There's
also
like
we're
going
to
have
to
figure
out
a
way
to
to
get
this
going.
Perhaps
we're
going
to
do
like
we
did
for
the
feature
categories
and
we
do
a
first
pass
of
the
most
popular
things
that
we
set
thresholds
on
and
that
can
then
later
be
adjusted
and
all
the
rest
just
stays
at
the
one
second
that
we
currently
have,
and
if
they
want
to
change
that,
then
it
will
have
to
go
to
review.
B
C
And
can
we
maybe
consider
this
is
just
a
suggestion
without
me,
knowing
any
technical,
technical
details
about
like
how
this
is
going
to
be
implemented,
but
I
I
kind
of
feel
this
needs
to
be
attacked
from
two
sides
so
from
two
levels:
it's
fine
to
do
it
in
the
application
like
you're,
proposing
bob.
I
think
that's
pro
the
right
way
to
start,
but
I'm
still
concerned
with
you
know
the
test
will
fail.
C
People
will
do
random
stuff
with
this
at
some
point
right,
like
I'm,
not
saying
someone
will
do
it
right
now,
I'm
saying
we're
gonna
grow
right,
like
it's
things
are
gonna
change,
so
it
would
be
good
if
we
have
two
levels
of
protection.
So
one
is
we
set
those
goals
inside
of
the
rails
app,
but
it
would
be
also
really
good
if
we
find
a
way
for
our
infrastructure
to
automatically
check
right,
like
whatever
is
being
input
there.
So
on.
C
I
don't
know
when
we
roll
something
out
when
we
deploy
something
right
like
emits
a
metric
and
then
alert
or
fail
or
prevent
from
the
deploy
from
going
forward.
So,
just
to
get
to
a
point
where
we
have
you
know
self-serve,
which
is
great
but
also
a
gate
where
we
say
that
our
infrastructure
can't
take
more
than
whatever
workaround
someone
might
have
placed
there.
A
Are
you
talking
about
this
single
sli
that
we're
talking
about
the
request,
duration?
Or
are
you
talking
about
slis
in
general,.
A
That's
a
very
good
point,
I
think
at
the
first
like
we
currently
have
that
gate
in
the
form
of
an
in
the
form
of
the
sli
that
we
have.
If
you
would
see
that
too
many
requests
would
be
slower
on
a
deploy,
we
will
just
like
qa
will
fail,
hopefully
on.
B
A
We'll
see
stuff
going
wrong,
we'll
stop
the
deploy
and
I
don't
think
that's
going
to
change
like
I
think
we're
going
to
still
have
an
overall
thing.
We
don't
want
all
of
them.
We
will
still
grade
requests
in
a
general
sense,
but
we'll
have
a
better
visibility
into
the
features
themselves.
On
top
of
that,
so.
C
C
I
almost
feel
like
there
needs
to
be
a
budget
for
this
budget.
You
know
like
if
you
say
everyone
has
this
much
and
you
can
go
up
to
maximum
of
10
10
seconds.
Let's
use
that
as
an
example,
we'll
add
more
things
to
the
application,
so
this
thing
is
going
to
grow,
but
our
infrastructure
supporting
all
of
this
is
probably
not
going
to
grow
at
the
same
rate.
So
we
need
to
say
that
you
know
if
you
have
a
bucket
of
60
seconds
total
to
spend,
you
can
only
have
you
know.
C
Six
groups
sets
10
seconds.
Everyone
else
gets
zero
right
like.
B
B
A
To
take
like
it's,
not
just
the
groups
that
are
like
some
groups
have
50
end
points
they
need
to
spend
they,
they
spread
they
bucket
over
the
50
endpoints.
But
what
if
one
endpoint
is
super
popular
and
others
get
hit
twice
a
month
like
this
all
stuff
to
factor
in
which
is
already
caught
by
a
general
request?
A
C
No,
it's
not
possible
right,
but
let's
say
in
this
impossible
scenario:
when
a
bug
gets
introduced,
limits
get
you
know,
ignored
or
something
like
that.
It
will
just
I
mean
I'm
hoping
we're
going
to
fail
in
early
environments,
but
at
the
same
time
I
think
it's
also
impossible
for
that
not
to
happen
in
staging
right.
So
so
it
would
be
good
to
have
like
another
layer
of
protection
here
or
think
about
it.
At
least.
A
I'm
going
to
take
a
note
of
that
on
the
on
the
epic
as
well.
I
don't
think
like
for
this
project.
It's
going
to
matter,
because
this
project
is
not
about
replacing
what
we
currently
have,
which
already
guards
us
against
that,
but
I
think
in
the
future
we
might
do
want
to
like
unify
both,
and
then
we
definitely
need
to
think
about
that
cool.
C
And
naively,
like
I'm
thinking
of
the
world
where
the
application
is
actually
going
to
emit,
metrics
and
inform
the
infrastructure
underlying
infrastructure,
this
is
how
you
need
to
scale
with
kubernetes.
A
lot
of
these
things
are
possible.
It's
just
a
lot
of
these
things
are
also
hard
to
do
so.
You
know
like
it
will
be
outstanding
if
we
get
to
that
point
at
some
time
at
some
point
in
time,.