►
From YouTube: 2023-02-16 Scalability team demo
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
A
A
B
A
A
So
as
an
SRE
I
get
paged
with
aptx
alerts,
we
use
symptom-based,
alerting,
and
so
it's
gonna
say
something
is
slow
right
and
out
of
our
job
in
scalability.
I
guess
in
general,
is
diagnosing
that
type
of
stuff
or
helping
diagnose
that
that
stuff,
and
so
it's
really
useful,
to
have
latency
attribution
so
that
we
can
not
only
see
what
were
the
slow
requests,
but
also
where
did
they
actually
spend
their
time
and
We've
invested
heavily
in
this
type
of
sort
of
almost
log
based
metrics
I?
A
Guess
you
could
call
them
well
for
the
rails,
app?
Where,
for
any
given
request,
you
can
sort
of
see.
Okay
second
spent
in
database
was
this
much
and
then
you
can
compare
that
to
the
overall
request
time
and
get
a
sense
of
what
was
the
expensive
part
of
this
request
and
then
aggregate
that
information.
A
So
on
many
of
our
other
services
there
are
not
rails.
We
don't
have
that
much
of
that
information.
In
some
cases
we
don't
have
any
of
that
information
and
gitly
in
particular,.
A
Has
many
different
sources
of
latency
and
so
just
to
kind
of
share
my
screen
just
to
kind
of
show,
let's
sort
of
maybe
talk
through
a
gitly
request
to
to
get
an
idea,
so
an
RPC
comes
into
Italy
and
it's
got
a
concurrency
limiter
at
the
beginning
of
the
request
cycle.
So
if
and
this
is
per
repo,
so
if
there's
like
a
gazillion
clones
happening
against
a
single
repo,
then
that
concurrency
limiter
will
kick
in.
This
is
configurable
per
RPC.
A
We've
only
enabled
it
on
some
of
the
long-running
rpcs,
but
this
concurrency
limiter
will
then
put
everyone
else
in
a
queue
and
they're
gonna
wait,
and
so
it's
really
useful
to
know.
If
we're
getting
alerts
for
slow
requests.
A
Okay
did
they
just
spend
their
time
waiting
for
for
to
like
waiting
their
turn
in
the
queue
and
then
the
other
thing
that
gitly
does
during
most
of
its
rpcs
is
shell
out
to
git
commands
right,
and
so
this
is
sort
of
an
external
call,
and
we
want
to
know
how
much
time
was
spent
during
that
external
call.
So
a
while
back
we
got
these
command
stats.
A
Let's
see
if
we
have
an
example
of
those
here,
maybe
not,
but
this
basically
says
how
much
system
and
user
time
like
how
much
CPU
time
was
spent
cumulatively,
in
this
request
for
all
commands
that
we
shelled
out.
So
that's
pretty
useful,
but
there
were
a
few
blind
spots,
and
so
this
was
a
recent
Edition.
So
concurrency
limit
was
one,
and
then
we
have
a
few
other
mechanisms,
so
spawn
tokens
are
a
way
of
protecting
against
excessive
shelling
out,
and
this
is
basically
another
cue
that
rpcs.
A
If
there
is
contention
our
PCS
are
gonna
have
to
wait
and
being
able
to
see
that
we
were
waiting.
Yeah
is,
is
a
useful
Edition.
So
let
me
maybe
open
a
few
of
these
and
we
can
sort
of
see
what
the
the
logs
look
like
didn't
have
an
example
here,
but
I
had
an
example
in
one
of
these
for
sure
sleeping
this
one.
B
A
So
app
file
is
I'm
not
going
to
go
into
what
it
is,
but
it's
a
mechanism
in
Italy
and
you
can
do
you-
can
sort
of
do
different
operations
against
this
cap
file
cache,
and
so
we
can
see.
A
Okay,
the
overall
duration
spent
in
cat
file
was
one
millisecond
and
then
we've
got
to
break
down
or
three
different
types
of
operations
here,
so
we've
got
a
request,
object,
we've
got
a
read,
object
and
we've
got
a
flush,
and
so
you
can
already
based
off
of
these
metrics
kind
of
tell
okay,
we're
requesting
a
lot
of
objects,
we're
reading
a
lot
of
objects,
but
we're
flushing,
not
that
often,
and
so
yeah
there's
actually
a
batching
mechanism
involved.
A
Yeah,
so
these
recent
editions
I
think
are
going
to
be
really
useful
for
diagnosing
alerts
and
and
other
latency
issues
in
Italy
and
I
just
wanted
to
highlight
that
and
see.
If
there's
any
questions
or
comments
on
that.
A
Yeah,
the
guitar
logs
use,
milliseconds
and
so
I
kind
of
based
it
off
of
that.
But
I
agree.
It
would
be
nice
to
have
consistency
and
in
fact
it
would
be
nice
to
also
have
consistency
in
field
names
across
logs,
which
is
pretty
terrible,
like
even
just
going
from
rails
to
Workhorse.
The
path
is
in
a
field
called
URI
and
stuff.
D
C
A
D
D
So
Bob
on
also
on
the
milliseconds
like
I,
feel
I.
Don't
know
why
I
think
this,
but
I
think
for
logs
it's
less
important
to
have
them
all
in
the
like
SI
unit
than
it
is
with
metrics
I.
Don't
really
know
why,
but
certainly
like
all
the
grpc
logs
that
we've
used
forever
have
always
used
milliseconds
and
it
feels
like
the
right.
You
know
it's
easy
to
crack
as.
C
D
A
Yeah
the
thing
I
was
going
to
say
about
this
is
the
what's
kind
of
important
to
me
is
that
it's
consistent
within
a
given
log
stream
so
that
you
can
compare
and
chart
them
against
each
other?
So
what's
kind
of
nice
to
do
is
have
a
single
chart
have
like
all
of
the
duration
Fields
summed.
B
C
A
D
But
yeah
I
think
that
point
three
is
a
good
thing
to
talk
about
with
that
ego.
Is
there
a
way
that
you
can
kind
of
communicate
this
information
with
like
staff
level
or
support
Engineers?
Who
might
find
it
also
super
helpful
for
self-managed
instances
where
they
might
be?
Having
you
know,
Italy
is
one
of
the
big
performance
issues
with
self-managed,
particularly
with
weird
file
systems
and
and
gapses
and
stuff.
So
how
would
you
how
you?
A
B
A
Yeah,
okay,
so
moving
on
to
the
second
one.
So
this
is
this
is
related
and
it's
sort
of
going
from
Italy
to
now
registry,
which
has
really
sparse
logs
and
to
be
fair,
I
feel
like
I.
Don't
need
to
go!
Looking
at
registry
locks
very
frequently,
it's
been
a
fairly
stable
service
and
I
guess.
A
I
have
not
had
the
need
that
often,
but
we
did
get
a
recent
incident
around
some
database
performance
issues
in
registry
and
it
was
kind
of
hard
to
pinpoint
just
because
we
don't
have
this
information,
and
so
the
step
going
from
alert
to
kind
of
the
external
system
that
is
most
affected
is
tough
because,
like
what
do
you
do
you
go?
A
Maybe
some
you
go
metrics,
phishing
and
hope
you
find
something,
and
so
I
did
open
an
issue
to
sort
of
promote
this
for
registry
as
well,
and
it's
basically
the
same
thing
right.
It's
doing
the
same
thing
for
registry
that
we
have
in
other
places,
and
so
there's
sort
of
two
questions
that
came
out
of
this.
For
me,
one
of
them
is
the
maturity
model
that
currently
doesn't
say
too
much
about
how
your
logs
should
look
like
it
just
says
they
need
to
be
structured,
Json
logs.
A
That's
it
so
I'm
curious
what
people
think
about
giving
a
bit
more
guidance
on
what
it
could
look
like
or
what
it
should
contain,
and
you
know
we
we
talked
about
the
elastic
common
schema.
This
could
be
sort
of
our
own
version
that
at
least
says
this
type
of
stuff
needs
to
be
present
and
maybe
give
some
recommendations
for
units
and
field
names
for
newly
created
logs.
So
at
least
we
have
consistency
going
forward.
C
A
A
C
C
A
D
D
A
D
It
I
I
always
feel
that
it's
it
should
be
not
that
it
is,
but
it
should
be
owned
by
scalability,
because
it's
all
application
performance,
possibly
but
I,
don't
think
they've
got
any
committers
on
it.
At
the
moment.
For
example,
maintainers.
D
The
lab
kid
Ruby
yeah,
but
but
certainly
one
of
those
two
teams
should
kind
of
become
more
because,
like
you,
you
say
it's
very
badly:
it's
ego,
Steve
myself
and
I,
don't
know
who
else
yeah
and
obviously
Bob
on
the
Ruby
one
as
well
and
mates.
A
Yeah
and
and
I
guess,
I'm
also
not
sure
whether
we
have
capacity
to
do
this
now
or
like
how
how
this
weighs
against
some
of
the
other
priorities
that
we've
got.
So
maybe
this
is
also
more
of
a
long-term
thing
to
to
kind
of
keep
in
mind,
as
we
finish
up
stuff
that
we're
working
on
at
the
moment.
C
It
I
don't
know
what
it
looks
like
currently
in
lab:
git
go
but
I
liked
what
we
did.
C
The
lab
kit,
Ruby,
actually
Matthias,
started
this
all.
He
moved.
We
had
a
bunch
of
loggers
and
stuff
that
lived
in
gitlab
rails
only
and
then
we
used
lab
kit
to
call
out
fields
to
put
in
those
log
messages.
So
every
time
we
use
the
logger
where
the
gitlab
app
logger,
that
is
a
Json
logger
or
an
import
logo,
or
anything
like
that.
C
We
would
call
out
to
the
const
context,
so
lab
kit
context
and
say
get
me
all
the
fields
to
add
to
the
logs
and
now
Matthias
a
while
back
has
moved
all
of
the
loggers,
the
structured
loggers
into
lab
kit
itself.
So
it
means
we've
got
the
Json
logger
there
now
and
this
merger
Quest
changed
part
of
the
behavior.
C
Here,
instead
of
just
putting
in
the
correlation
ID
and
making
sure
that
is
always
there,
it's
going
to
include
the
entire
context
all
of
the
time.
So
that
means
anything
that
uses
a
Json
logger
and
everything
in
gitlab
rails
is,
is
going
to
have
current
context
information,
always
we
could
add
more
Fields
here
and
so
on,
yeah
and
expand
on
that,
but
I
like
that.
Now
all
of
these
messages
will
have
at
least
these
fields
consistently,
and
we
could
also
start
propagating
them
across
services.
This
way.
D
Yeah
on
on
just
on
the
context,
propagation,
that's
something
that's
quite
strong
on
like
tracing
Frameworks,
so
they
they
have
that
sort
of
weird
bag
of
things
that
you
could
kind
of
push
with
your
with
your
Trace
and
it's
one
of
the
things
it's
kind
of
really
good,
but
also
kind
of
terrifying,
because
people
can
put
like
8K
messages
in
there
and
pass
that
around
to
get
a
lead
50
times
in
a
in
a
request.
But
in
general
it's
a
really
useful
thing.
D
But
if
you
go
look,
that's
very
much
part
of
like
distributed
tracing.
That's
the
context!
Propagation!
That's.
C
A
part
of
well
with
mixed
success
with
limited
what
goes
into
the
context
from
gitlab
rails,
at
least
to
be
tied
to
a
thing
like
there's.
This
thing
called
application
context
and
gitlab
rails,
and
it
wants
like
IP
addresses
records
that
are
a
project
or
a
user
or
whatever,
and
then
it
expands
information
from
that.
But
it's
not
like
you
need
to
add
a
field
and
its
type
and
everything
into
the
class
before
you
can
put
stuff
in
so
that
was
in
an
attempt
to
yeah.
That's.
D
C
D
Then
I
think
for
passing
it
between
is
it?
Is
it
we
just
putting
a
header
in
or
something
like
that
right.
C
Passing
between
was
like
I
would
do
that
with
the
header
like
from
rails
to
Workhorse,
we
could
use
headers
from
like
the
rack
request,
headers
we
could
use
for
those
from.
C
Rails
to
gitly
we
could
I,
don't
know
what
that
call.
That's.
D
C
The
data
thing
yeah,
okay,.
C
Apparently,
I've
got
a
the
next
thing
as
well.
Let's
just
while
we
were
looking
into
Prometheus
and
tunnels,
and
now
the
tunnels
is
its
own
service
with,
like
it's
separate
dashboard.
Actually,
that's
nicer
to
show
start
with
that.
One.
C
C
A
C
D
One
of
the
things
that's
that's
worth
stating
is
that,
with
with
this
dashboard
previously
we
had
it
kind
of
split
across,
like
all
the
environments
so
like
we
do
with
all
of
our
other
services.
So
we
have
production
and
staging,
but
really
it
didn't
make
any
sense,
because
Thanos
is
operating
as
a
single
service
right.
D
It's
not
multiple
services
in
each
environment,
there's
one
service
and
so
breaking
it
up
in
that
way,
actually
really
hides
a
lot
of
the
information
from
you,
and
so
we
had
to
jump
through
a
whole
lot
of
Hoops
to
get
our
monitoring
system
to
be
able
to
monitor
across
all
environments
as
a
single
thing.
It's
that's.
D
A
The
sidecar,
Sidecar
and
stool
Thanos
store.
D
D
D
Yeah
I
think
I
think
that
that's
that
that's
something
we
should
definitely
consider
we've
also
Bob
and
I
were
also
talking
about
doing
the
same
thing
with
the
logs,
although
I'm
pretty
skeptical
at
the
moment
that
these
logs
are
being
collected
properly
because
I've
been
logging
onto
the
Thanos
router
box
to
get
logs
that
I
can't
find
in
in
any
of
the
elasticsearch
bugs
at
the
moment,
yeah.
D
C
Ruler,
I
am
proposing
to
add
a
new
SLI
this
one
that
is
supposed
to
measure
how
long
a
rule
group
takes
compared
to
its
interval.
So
if
it,
if
the
execution,
the
ruler
execution
of
all
the
rules
in
the
group
take
longer
than
the
interval,
that's
going
to
be
an
error
on
the
SLI
and
I
want
I,
wanted
to
add
this
one
for
Prometheus
and
for
Carlos.
C
So
that's
the
one
that
I'm
working
on
now
and
I
want
that,
because
I
want
to
get
rid
of
some
like
for
the
future
category
metrics,
we
use
intermediate
recording
rules,
but
those
are
not
recorded
in
the
same
group
and
I'm
thinking
that
some
of
them
like
yeah,
and
what
to
get
rid
of
this
in
Direction
because
I
think
that's
some
of
them.
The
errors
and
like
things
going
like
within
an
app
text,
ratio
being
bigger
than
one
bigger
than
100
and
so
on
are
coming
from
that
interaction.
All.
D
E
Problem
the
problem
is,
we've
talked
about
this
before,
but
the
problem
is
that
we're
using
success
ratio
for
one
and
aeration
for
the
other,
like
if
the.
D
I'll
need
to
look
through
it,
but
because
the
reason
I'm
thinking
is,
we've
got
this
rule
evaluation
right
and
we've
already
got
an
error.
It,
which
is
the
the
rule,
gave
a
warning
or
an
error
when
it
was
evaluated
and
that's
quite
a
nice
error
rate,
but
we
don't
have
any
sort
of
app
decks
or
any
sort
of
latency
measurement
or
latency
SLI
for
a
rule
evaluation,
and
this
really
is
kind
of
a
latency.
It's
like
it's
saying
it's
taking
too
long,
but
I
I
know
that
it's
it's
really
hard
to
measure
the.
D
I
think
you
have
to
have
a
custom.
You
have
to
build
a
custom
metric
for
the
aptx,
because
it's
not
a
it's,
not
a
history,
so
the
norm
normally
for
app
decks.
We.
D
D
D
D
But
if
we
can
I
just
think
you
know
someone
a
year
down
the
line,
who's
on
call
and
they
get
an
alert.
It
will
be
clearer
to
them
that
this
is
a
you
know,
a
slowness
problem
rather
than
a
like
an
error,
because
it's
coming
out
as
a
as
a
slowness
SLI.
But
yes
from
a
from
an
affordance
point
of
view.
If
you
want
to
say.
C
D
D
C
D
Bob
put
this
on
here,
but
I'm
happy
to
talk
about
it,
and
so
chance
is
doing
this
ownership
thing
for
all
the
services
and
I
think
it's
really
important,
especially
as
we
are
like
as
the
new
reliability
teams
are
forming
and
I
think
it'll
help
them
kind
of
actually
get
like
a
real,
solid
idea
of,
like
which
of
those
teams
owns
Which
services
and
then
also
which
which
of
our
teams.
On
the
on
this.
D
On
the
platform
side,
you
know
what
we
own
and
then
Rachel
asks
a
really
good
question,
which
is
so
so
we
said
we
we're
going
to
get
the
managers
to
kind
of
choose
which
things
they
their
own
and
then
Rachel
said
okay,
but
what
it?
D
What
does
ownership
actually
mean,
which
is
a
very
good
but
I
think
also
very
hard
question
to
answer
and
one
that
could
be
fairly
philosophical,
but
what
I
suggested
was
that,
like
we
just
keep
it
very
pragmatic
for
now
and
effectively
what
it
means
is
when
something
gets
labeled
in
a
in
in
the
tracker
with
service
colon,
colon
X,
we
know
which
team
is
going
to
triage
that
issue
right.
D
It
basically
false
you
know,
and
we
can
write
a
triage
Ops
tool
for
basically
assigning
the
team
based
on
that
and
it's
their
responsibility
to
keep
the
the
triage
on
those
issues,
and
that
doesn't
mean
that
they're
going
to
fix
everything
but
like
at
the
least
they
can
say,
won't
fix
or
we're
going
to
put
this
on
the
schedule.
We're
going
to
reassign
it
to
a
different
team.
But
they
they
are
the
the
kind
of
entry
into
that.
D
And
you
know,
via
that,
all
the
capacity
planning
issues
or
if
there's
any
sort
of
availability
issues
will
arrive
at
them,
because
they
are
that
first
line
of
of
contact
for
that
service.
So
that's
kind
of
it
and
then
kind
of
sort
of
more
sort
of
broadly,
as
a
future
thing
which
is
kind
of
might
be
controversial,
is
you
know
the
scalability
team's
done
a
lot
of
great
work
to
kind
of
push
error
budgets
left
in
services
like
web
and
API
and
sidekick
that
are
running
application
services?
D
And
maybe
it's
also
time
to
start
thinking
about
for
some
for,
like
a
limited
number
of
services,
also
having
some
sort
of
error
budgets.
But
on
the
infrastructure
side
so
like
the
one
that
I'm
thinking
of
most
is
to
try
and
prevent
the
Thanos
problem
in
future
right.
So
if
Thanos
is
kind
of
degrading
and
kind
of
getting
worse
and
worse
and
worse,
we'll
get
to
a
point
where
we
say.
Okay
like
this
is
something
that
needs
eyeballs
on
it
and
we
start
doing
the
same
sort
of
error.
D
Budget
reporting
as
we
do,
for
application
teams
on
the
application
side.
But
for
certain
infrastructure
services
like
I,
don't
think
it
would
make
sense
to
have
double
reporting
for,
like
the
web
service,
for
the
application
teams
and
for
the
infrastructure
teams
like
I,
think
we've
got
that
service
covered,
but
there's
a
whole
bunch
of
services
that
don't
have
the
same
kind
of
coverage,
and
maybe
it
makes
sense
to
start
doing
the
same
sort
of
accounting
for
for
those
services.
And
it's
very
much
like
a
very
rough
early
sort
of
idea.
D
C
Are
now
like,
we
scalability
has
future
categories
that
apply
to
us
and
that
we
can
use
inside
the
application
and
that
could
trickle
down
into
slis
and
so
on.
Why
can't
we
do
that
for
all
teams
in
reliability,
for
example,
tunnels.
B
D
B
D
B
D
Team
yeah
yeah
exactly
and
that's
but
the
the
yeah,
so
the
team
ownership.
That's
that's
that
connection,
but
yeah
feature.
Category
I
think
is
like
a
reserved
word,
a
gitlab
because
it
very
specifically
means
like
a
thing
but
yeah,
some
a
categorization
and
we
roll
them
up.
So
we
can
look
at
it
at
a
higher.
You
know
one
of
the
things
with
the
error
budgeting
was
because
we
could
roll
it
up
to
the
level
of
teams
and
then
to
the
level
of
of
directors.
C
D
C
D
E
D
D
We
have
whatever
we
call
it
like
naming's
hard,
but
like
so
you're
saying
that
Thanos
and
Prometheus
are
like
metrics,
yeah
and
CP
is
exception
management
and
then
in
future.
If
those
two
things
go
to
different
teams,
we
don't
change
anything
except
for
that
attribution
in
the
same
way
that
feature
categories,
work,
yeah.
D
The
service
category
might
be
a
good
name
for
that.
Yeah
I
mean
it
works.
It
works
very
well
on
the
product
side
to
have
those
feature
categories
and
then
they
subscribe.
We
I
don't
know
if
you
kind
of
have
followed,
there's
a
there's,
a
file
in
the
handbook
called
stages,
yaml
and
then
things
get
moved
around
and
we
don't
have
to
worry
about
rearranging
because
we
just
kind
of
follow
the
feature
categories
in
that
file.
B
D
Yeah,
so
so
do
we
do
we
do
we
are
we
do
we
do
the
feature
categories
now
as
part
of
that
service
category
thing,
or
do
we
just
do
the
ownership
now
and
then
do
that
as
a
second
stage,
probably
better,
not
to
nerd
snipe
it
and
just
get
it
done,
and
then
we
can
break.