►
From YouTube: Scalability Demo Call - 2021-06-10
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
A
A
A
Bubble,
I
was
wondering
if
you,
if
you
wanted
to
show
what
we've
just
been
talking
about
with
the
panels
that
have
just
been
merged
into
the
dashboard
and
then
the
piece
of
work
with
the
operation
count
that
hasn't
yet
been
merged.
Yeah.
I
was
just
reviewing
that
merchandise.
B
A
Hey
bob's,
just
getting
ready
to
present.
A
Well,
the
reason
that
I'm
excited
about
what
bob
is
going
to
show
is
because
there
was
a
conversation
yesterday
that
the
where
the
ems
were
talking
about
being
able
to
see
into
what
was
causing
problems,
and
this
gives
them
the
ability
to
see
it,
and
it
would
just
be
so
nice
to
close
the
circle
with
what
they're
asking
for
by
saying.
Well
here
it
is,
you
can
see
it
here.
B
So
am
I
sharing
my
screen
yeah,
I
think
so
what
what
we've
added
is
this
row,
which
has
a
bunch
of
text
explaining
what
what
is
being
shown?
This
is
the
most
important
part
it
shows
the
breakdown
of
the
error
budget
and
where
it's
being
spent.
This
is
like
that
the
number
of
fail
like
this
is
supposed
to
it
resembles
the
number
of
failures.
B
So
it's
a
total
failures
that
have
been
tracked
in
the
28
days
that
are
used
for
the
budget
fixing
the
highest
one
is
what
would
help
and
then,
because
I
couldn't
figure
out
how
to
add
links
here.
We've
added
the
links
here.
So
if
you
see
that
the
aptx
for
puma
is
the
number
one
source
of
the
of
the
budget
spent,
then
you
can
click
open.
This
link,
which
will
result
in
a
very
slow
table-
and
this
is
already
this
version-
isn't
deployed.
B
Yet
it's
what
including
sean's
improvements,
because
I
like
it
better
and
that's
going
to
show
I'm
just
going
to
ask
sean
to
sort
it
by
this,
and
it's
going
to
show
you
the
end
points
how
many
total
hits
you've
had
and
how?
Many
of
those
were
slower
than
than
the
threshold,
and
so
that
should
allow
to
the
people,
the
engineers
and
so
on,
to
dig
into
where
the
budget
is
being
spent.
B
They
start
with
the
highest
number
here
and
then
look
what
what
end
points
so
sidekick
jobs
or
requests
are
attributing
to
that
high
number
and
for
now,
we've
only
built
these
fancy
links
in
kibana
for
puma
and
sidekick,
because
most
people
are
spending
most
of
their
budget
there
yeah,
that's
it.
C
When
I
see
what
it's
for,
I
think
the
whole
discussion
is
moot,
because
this
is
just
for
ranking
and
comparing
things,
and
nobody
cares.
What
this
number
actually
means.
B
For
the
query
that
results
in
this
here's
the
same
kind
of
sum
over
time-
and
we
have
it
in
the
longer
range
ratios
as
well
so.
B
C
I
I
saw
that
you
gave
more
concrete
examples,
which
is
great,
but
I
haven't.
I
just
saw
it
like
10
minutes
before
the
call-
and
I
started
thinking
about
it-
thinking.
Oh
yeah.
This
is
difficult,
but
now
one
of
the
three
concrete
examples
you
just
showed
on
the
screen-
and
I
thought
okay,
we
can.
C
If
I
from
my
point
of
view,
we
can
stop
talking
about
that
particular
one,
because
it's
if
it's
just
a
number
to
direct
people,
to
point
people
in
the
right
direction
and
it
doesn't
really
matter
if
it's
we.
B
B
D
C
Harder
about
what
that
means,
but
I
don't
want
to
hijack
this
conversation.
I
just
saw
that
one
thing
and
I
wanted
to
check
that.
That
was
the
thing
we
were
also
talking
about
on
the
other
issue.
C
Yeah,
because
this
is
a
ranking,
this
is
really
only
about
ranking
and
not
about
what
the
number
means.
B
C
B
A
Also,
while
we're
on
that,
I
also
just
wanted
to
say,
like
thank
you
also
to
andrew
for
helping
to
prep
getting
to
the
point
that
we're
able
to
do
this,
because
it
feels
like
there's
a
whole
bunch
of
work.
That's
happened
underneath
this
and
being
able
to
take
the
result,
which
is
the
error,
budgets
and
show
it
to
the
pms
and
the
ems
is
just
it's
just
excellent,
and
so
I
just
wanted
to
say
thank
you
for
the
best.
B
The
thing
is,
it
took
me
a
while
to
understand
where
we
were
going
with
all
the
stuff
like
we've
started,
adding
feature
category
like
that's
the
first
thing
we
did.
That
was
pushing
us
like
in
an
open
in
this
obvious
direction,
and
that
was
like
by
now
a
year
and
a
half
ago,
and
then
I
didn't
understand
yet
what
it's
going
to
be
useful
for.
E
Yeah,
I
think
I
think
what
we
said
was
like
one
of
the
things
that
like
when
we
were
thinking
about
what
the
scalability
team
was
going
to
do.
It
was
like
we've
got
a
monolith
and
we're
not
you
know
whether
we
like
it
or
not.
The
monolith
is
going
to
stay,
so
we
need
to.
We
need
to
figure
out
a
way
to
have
attribution
like
in
in
other
big
companies
what
they
do
is
they
just
break
it
down
into
services,
and
it's
really
easy
to
blame
a
team
on.
You
know
this
service
is
this
teams?
E
E
No,
no,
no.
No.
It's
really
easy
to
do
attribution
with
one
of
the
things
about
micro
like
or
any
sort
of
service
architecture
is
that
you
know
attribution
becomes
easier,
and
so
it's
like
how
do
we
do
this
inside
the
monolith?
And
you
know
I
think,
like
we're
kind
of
setting
up
some
really
cool
things
here,
and
I
don't
know
if
you've
seen
that
the
database
team
are
now
going
through
a
similar
exercise
with
the
tables
and
actually
on
that.
E
I
think
it
would
be
really
good
for
someone
from
scalability
team
to
get
involved
because
they're
still
talking
about
team
attribution
and
I
think
they
should
be
talking
about
feature
category
attribution
of
the
tables
right
and
and
using
the
the
you
know
at
least
on.
I
don't
know
how
many
tables
we
can
do
it
on,
but
where
possible,
use
the
attribution
framework
that
we've
got
on
those
entities
or
something
similar
for
you
know
active.
A
Two
models:
yeah
exactly
that
issue
back
some
back
into
the
the
doc.
The
one
about
the
database
table
attribution.
E
E
C
Be
ashamed
if
they
reinvent
some
of
this
attribution
stuff
yeah.
If,
if
it's
really
trying
to
do
the
same
thing,.
E
C
B
Well,
as
I've
seen
now
with
something
that
huang
ming
was
building
to
add
the
feature
categories
into
century,
but
we're
on
a
lower
version
of
sentry
that
doesn't
allow
filtering
for
multiple
categories
at
once
might
be
handy
to
have
the
thing
that
we
now
have
in
metrics.
The
mapping
that
maps
as
a
feature
category
to
a
group
inside
something
in
the
application
as
well
yeah.
But.
E
Yeah,
no
more,
no
more,
no
more
century
errors
for
you
today.
Sorry
you've
you've
spent
your
century
cool.
Do
you
want
to
move
on?
I
thought
I.
E
Yeah
I
was
madly
trying
to
find
one
of
these
series,
while
we
were
going
through
the
last
thing
and
I
have
been
unable
to
find
one,
but
I'm
still
going
to
try
to
explain
it.
But
this
might
be
a
very
poor
demo,
so
kind
of
the
type
that
we
are.
E
So
what
it
was
was
craig
I
and
I
were
talking
this
morning
and
craig
had
this
issue.
Actually,
maybe
I've
found
one
here.
Craig
craig
pointed
out
this
issue
where
he
was
looking
at
the
rate
of
certain
sidekick
jobs,
and
he
knows
that
they
get
called
very,
very
infrequently
so
once
an
hour
or
once
every
six
hours.
E
But
if
you
did
a
rate
on
them
a
rate
on
the
job,
they
always
came
back
as
like
0.0,
not
zero
point,
something
something
something
and
then
a
little
number
but
flat
zero,
and
it
doesn't
make
any
sense,
because
we
know
that
these
current
jobs
run
once
every
now
and
again
and
then
and
so
craig.
E
One
of
the
things
that
craig's
looking
at
doing
is
putting
better
monitoring
on
these
low
frequency,
but
high
importance
jobs
like
the
stuxxer
worker
and
you
know
which
hasn't
been
running
for
months
and
because
it's
below
the
threshold
that
we
allow
for
for
our
monitoring.
We
basically
ignore
it,
but
it's
critical.
So
we
can't
do
that,
and
so
we
were
looking
at
it
and
we
discovered
a
really
interesting
thing.
And
so
I
thought
it
would
be
good
to
bring
it
up
in
the
demo,
but
of
course
I
can't
find
one
of
these
jobs.
E
E
No
yeah,
I
tried
to
it's:
let's
just
see
if
we
can
sidekick
jobs,
buckets.
E
E
I
see
zeros
yeah
here
we
go
so
yeah.
This
is
a
perfect
example,
so
that's
that's
actually
really
interesting
that
it
changes
through
the
day
as
well.
It's
part
of
the
part
of
the
mystery,
so
here
we
have
a
and
we'll
just
we'll
just
pretend
we
were
this
morning
like
when
I
was
talking
to
craig
because
that's
gives
us
a
better
result.
E
E
Wow,
okay,
so
this
job,
this
ci
drop
pipeline
worker.
It
ran
on
let's,
let's
try
zoom
in
on
this
period
a
bit
until
what
was
that
ninth.
G
E
Right
so
we
know
that
this
job
ran
at
1946,
okay
and
if
we
go
and
look
at
a
rate
on
that
job
right.
E
C
E
E
So
when,
when
this,
when
this
process
the
sidekick
process
starts,
it
doesn't
initialize
that
sidekick
metric
to
zero
it
just
that
that
particular
series
for
the
ci
drop
pipeline
worker
in
this
case,
just
didn't
exist
right
and
the
moment
that
it
comes
into
existence
is
the
moment
at
which
we
set
it
to
one
right
and
so
the
rate
over
there.
It's
just
the
derivative
on
that
number
there
it
it
never
went
from
zero
to
one.
E
It
just
appeared
into
existence
at
one,
and
so
when
we
have
containers
that
are
starting
up
much
more
frequently
than
before.
Very
often
they
go
from
non-like
absent
to
one
and
they
can
never.
They
never
increase.
E
E
And
they
never
and
we
we
see
this
happening
all
the
time
with
these
low
frequency
jobs,
so
we
always
lose
the
first.
Basically,
we
always
lose
the
first
one
and
for
a
lot
of
jobs
that
doesn't
really
matter
too
much
because
they're
high,
you
know
10
times
a
second
or
whatever.
If
you
lose
one,
it
doesn't
matter,
but
for
the
low
frequency
jobs
that
that
don't
happen
very
often,
it
makes
a
big
difference,
and
so
I
said
well,
you
know
maybe
what
we
can
do.
E
My
first
proposal,
which
is
a
horrible
proposal,
but
it's
easy
to
do
was
let's
just
when
the
job
starts.
We
initialize
it
to
zero,
but
lots
of
these
jobs
will
run
for
less
than
15
seconds,
so
we'll
set
it
to
zero,
we'll
set
it
to
one
and
then
the
scrape
will
happen
and
it'll
just
have
the
same
effect.
So
we
spoke
about.
E
H
If
you
add
one
inside
the
rate,
so
I
would,
I
would
guess
that
null
plus
one
is
still
no,
but
no
plus
zero
is
one.
E
No,
I
think,
I'm
pretty
sure,
I'm
pretty
sure
the
way
it
happens
in
prometheus
is
that
it
will
whatever,
because
it
doesn't
see
the
the
the
step
up.
The
the
one
place
where
prometheus
is
different
is,
if
you
have
a
reset
on
a
counter
right,
then
it
will
say:
okay,
it
was
five
and
then
the
next
time
we
scraped
it
was
three
and
then
it
will
kind
of
like
find
the
equidistant
middle
point
on
the
right.
E
And
that's
often
why,
if
you
have
something
that
very
clearly
is
like
integers,
you
know
like
whole
numbers
and
you
do
a
rate
on
it.
You'll
get
like
an
increase.
C
E
C
E
B
I
well
yeah
for
all
of
the
jobs
I
could
possibly
run.
That
might
be
a
good
idea,
because
I
had
a
similar
problem
once
and
then
we
decided
that
it
was
for
low
frequency,
endpoints,
so
http
requests,
and
then
I
think
it
was
sean
and
me
said:
okay,
we'll
just
initialize
all
these
metrics
and
then
we
had
a
carnality
explosion
of
roots
that
are
never
going
to
be
used
on
gitlab.com
yeah.
So
then
ben
was
angry
at
us
and
we
did
the
middle
ground.
C
Take
is
to
think
of
this
as
a
short
running
process
that
can't
get
scraped,
and
then
the
way
to
get
metrics
would
be
a
push
gateway.
E
Unfortunately,
not
because
push
gateway
doesn't
work
very
well
as
an
aggregator.
You,
like
push
gateway,
doesn't
have
the
concept
of
like
updating
state
you,
it's
not
like
statsd,
where
you
say
increment
this
counter.
You
say
this
metric
is
one
and
then
the
next
time
you
run
you
don't
know
what
the
old
metric
was.
It's
not
a
it's
not
like
redis.
E
If
you
want,
where
you
kind
of
send
it
an
update
or
increments,
or
anything
like
that,
it's
you
only
give
it
and
and
push
gateway,
is
kind
of
very
it
guards
that
sort
of
thing
very
much.
It
says
you
can't
you
can't
be
the
opposite
of
that.
So
I
don't
think
that
there's
an
easy
way
to
do
it
with
push
gateway.
E
E
You
know,
because
we've
seen
what
can
happen
when
these
jobs
are
failing
all
the
time
and
it's
it's
a
bad
thing,
and
so
so,
if
if
we
can
get
it
down
to
only
the
jobs
that
run
on
a
certain
fleet,
it
might
be
okay,
because
you
know
eventually
it'll
get
up
to
that
number.
C
Another
way
of
looking
at
it
is
that
we
are
updating
counters
spread
across
many
different
series,
where
they
should
be
one
series.
So
if
we
had
a
single
counter,
we
were
updating
in,
say:
redis,
then
you
don't.
E
B
E
So
so
I
think
the
the
the
easiest
solution
will
be
to
to
figure
out
how
many
more
series
it's
going
to
be
and
then
and
then,
if
it's
a
lot,
then
it's
time
to
break
prometheus-app
down
into
prometheus-app
and
prometheus-sidekick.
E
That's
like
the
boring
solution,
and
if
it's
not
that
many,
we
can
just
keep
them
all
in
prometheus
app.
I
suspect
that
it's
much
fewer
series
than
we
have
in
prometheus
db,
which
has
got
all
the
pg
stat
statements,
combinations,
which
is
like
a
lot
of
data.
C
If,
if
we're
only
updating
these
latencies,
that's
I
guess
that
would
be
the
number
of
buckets
times.
The
number
of
workers.
D
C
But
then
you
have
an
increase
because
of
the
number
of
parts,
but
parts
go
out
of
existence
and
then
prometheus
should
garbage
collect
them
or
how
does
that
work
towards.
E
Cardinality
it
yeah
it
well
the
the
it's
fine
kind
of
in
the
now,
but
where
it
gets
really
heavy
yeah,
because
this
was
where
it
gets
really
painful.
Is
that
when
you
do
a
rate
over
like
six
hours
or
something
like
that,
it's
still
got
to
go
and,
like
you
know,
go
to
all
of
these
different
buckets.
I
don't
I,
I
think
the
simple
solution
is
just
to
pre-initialize.
E
If
we
can
like,
if
you
know
that
that
yeah
keeps
things
simple,
but
we've
just
got
to
figure
out
the
the
content.
That's.
E
C
Properly
yeah,
we
also
have
very
some
very
low
frequency
criteria
rpcs,
but
there
we've
used.
E
Don't
have
that
particular
problem,
yeah
yeah
yeah,
so
it's
kind
of
a
interesting
edge
case,
but
in
order
for
the
stuff
that
that
craig's
looking
at
to
kind
of
really
be
properly
done,
we
will
probably
have
to
consider
doing
that.
I
also
think
you
know
we've
had
three
prometheuses
for
a
long
time,
so
you
know
if
we
have
to
break
that
into
four
it's
you
know
not
the
end
of
the
world.
C
E
C
E
It's
maybe
six
seconds,
so
you
know
because
a
lot
of
them
the
work
will
probably
be
quite
divisible,
so
you're
not
creating
a
lot
more
work
by
running
it
more
frequently,
if
you
know
what
I
mean,
but
but
that
will
require
like
engagement
with
teams
and
it's
kind
of
like
a
one
by
one
sort
of
thing.
So
we
don't
really
want
to
do
that.
So
what
we
spoke
about,
it
might
even
be
in
this
query:
let's
take
a
look
well.
E
That's
yeah
exactly
so,
but
going
back
to
your
original
question,
we
had
we
had
quite
a
long
discussion
around
this
and
we
both
got
very
excited
about
it
and
it's
a
very
nerdy
discussion.
So
this
is
what
we've
got
at
the
moment.
This
is
fundamentally
wrong.
So
we've
got
it's
not
fundamental.
It's
like
the
last
level
of
maturity,
so
now
we're
getting
to
the
next
level
of
maturity.
E
So
we
have
this
as
our
one
hour
and
our
five
minute
rates,
and
then
we
have
this
as
our
6
hour
and
our
30
minute
rates.
Okay.
So
the
first
thing
that
we
really
need
to
do
is
break
that
into
two
alerts,
because
then
you
can
start
doing
really
nice
things
like
if
you
had
a
really
bad
morning
and
we
had
a
whole
bunch
of
stuff-
and
we
know
that
that
we've
spent
our
six-hour
budget.
You
can
silence
a
six-hour
budget
and
still
get
the
alert
for
the
one
hour.
E
E
We
just
we're
going
to
say
that
the
minimum
alerting
threshold
is
10
samples
and
the
way
that
we
can
do
that
is
we
take
this
clause
over
here
and
we
move
it
into
these
two
things
like
this
live
coding,
my
favorite
thing
not,
and
then
we
do
that
and
then
this
is
obviously
a
second
alert.
E
But
we
say
here
we
say
that
times:
3600
that'll
give
us
effectively
the
number
of
samples,
because
the
rate
is
per
second
and
there's
3
600
in
an
hour,
and
then
we
say
that
that
needs
to
be
greater
than
10
or
whatever.
E
That
we
choose
is
the
magic
low
minimum
of
samples
that
we
need
in
order
to
evaluate
the
service,
and
then
we
say
the
same
pretty
much
the
same
thing
over
here
and
again,
we
just
say
it's
on,
except
here
we
got
it
on
the
the
six
hour
rate,
and
we
say
times
six
times
is
more
than
ten
samples
and
then
the
last
thing
that
we
do
is
we
set
up
the
three
day
one
day
three
day,
you
know
the
third
tier
which
we
don't
have,
and
it's
also
something
that
we
should
really
fix,
and
that
will
also
say
that
over
a
three
day
period,
you
have
to
have
ten
samples
as
well,
and
so
then
we
get
away
from
this
like
having
a
minimum
rate.
E
You
know
of
a
very
you
know,
over
six
hours
at
a
minimum
rate
of
0.1.
That's
you
know,
thousands
of
I
don't
know.
Well,
it's
it's
1
000,
something
samples
which
is
actually
very,
very
high,
so
we
can.
We
can
break
that
down
and
still
monitor
the
low
frequency
jobs.
E
Yeah
and
it's
it's
the
same
so
when
I
say
sample
I'm,
you
know
I'm
thinking
of
it
from
a
sort
of
statistical
point
of
view,
but
it's
it's
yeah.
It's
each
operation
is
one
sample
that
we're,
including,
and
the
reason
we
had
that
ops
filter
was
to
just
kind
of
filter
out
low
sample
rates.
You
know
where
you
get
three
things,
and
you
know
it's
not
enough
data
to
really
build
something
up
here.
So
if
we
take,
let's
just
take
this
and
see
if
this
works.
E
E
Yeah
yeah
but
yeah
so
the,
but
basically
then
we
can,
you
know
over
a
longer
period.
We
can
start
doing
monitoring
of
the
of
the
low
frequency
jobs.
The
other
thing
is
just
worth
pointing
out.
Is
we'll
never
have
these
the
three
day
you
know
the
long
period
monitoring
go
to
go
to
the
sre
on
call.
E
We
have
that
go
straight
to
an
issue
tracker
and
we
can
use
the
feature
category
routing
that
we've
got
for
that.
So
then
it
instead
of
us
getting
a
job
about
stuxxci
jobs,
not
working.
It
goes
straight
to
the
this
is
kind
of
the
future
future.
This
is
a
few
steps
ahead
right,
but
there's
no
reason
why
the
sre
on
call
needs
to
deal
with
that.
It
should
just
go
straight
to
the
team
that
that's
responsible
for
that
job.
H
H
E
Okay,
cool
yeah
we
and
it's
it's
pretty
easy
to
I've,
been
meaning
to
do
it
for
a
long
time
as
well.
But
now
this
seems
like
a
good.
Oh
there's,
one
there's
one
not
not
even
complicated,
but
slightly
complicated
thing
that
we
have
to
take
into
account,
and
that
is
that
often,
when
things
go
really
pear
shaped
the
long
and
the
short
both
instantly
drop,
and
so
one
of
the
things
that
we
don't
have
in
our
alert
manager.
Config
at
the
moment
is
we
don't
use.
E
I
even
forgot
the
name
of
it.
Alert
manager's
got
a
thing
where
you
can
silence
one
alert
based
on
the
existence
of
another
alert,
and
so
what
we
don't
want
to
do
is
send
pager
duties
to
the
sre
on
call
to
say
your
six
hour
and
your
one
hour
are
both
violating
now.
Here's
two
here's
two
pager
duties.
E
We
only
want
one
and
what
are
they
called
suppression,
suppression,
rules
and,
and
we
can
set
up
a
suppression,
rule
in
alert
manager
to
say
if
the
one
hour
is
firing
forget
about
the
six
hour
we,
you
know
we
don't
care
about
that.
We
just
just
tell
the
person
the
only
thing
that
I'm
that
I
have
a
slight
concern
about
is:
if
you
get
those
suppression
rules
wrong,
you
could
do
really
horrible
things
by
accident,
and
so
that's
why
I've
always
been
a
little
bit
cautious
about
when
I'm
going
to
roll
those
out,
but
yeah.
H
I
personally
I
would,
I
would
just
accept
that
we're
going
to
get
double
paged
as
an
initial
step.
Okay,
speaking
as
an
on-call,
my
pager
often
blows
up
with
multiple
alerts
at
the
same
time.
So
it's
it's
not
pleasant,
yeah
by
any
means,
but
it's
tolerable.
H
A
Was
there
anything
else
that
anyone
would
like
to
to
demo
or
to
show.