►
From YouTube: 2022-10-06 Scalability Team Demo
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
I'll
do
that
so
timeland.
This
came
up
when
Stephanie.
Suddenly
nobody
was
available
and
then
Stephanie
and
Matt
needed
to
review
things
that
I
treated
them
and
they'd
never
seen
timeline
before
so.
I
think
this
is
a
little
bit
of
an
introduction
about
what
it
does
and
how
it
does
it.
So
where
shall
we
start
any
suggestions
on
where
to
start
so
we
start
with
the
source
metrics
and
the
saturation
point
in
the
runbooks.
A
No
but
then,
let's
whatever
you
think
about
when
we
are
looking
at
things
like
throw
it
out
because
yeah
I
don't
know
what
what
to
talk
about
neither
and
I've
just
been
doing
stuff,
so
the
Run
books.
We
have
these
things
called
saturation
points
in
our
run.
Books
look
at
the
redis
CPU.
That
was
the
one
that
I
was
recently
looking
at,
and
the
point
of
these
metrics
that
we
Define,
like
the
query
that
we
defined
here,
is
it's
supposed
to
spit
out
a
percentage.
A
So
a
number
between
zero
and
one
zero
is
excellent.
One
is
burning,
and
then
we
set
the
lows
on
that
where
this
is,
where
we're
going
to
alert.
That's
a
short-term
thing.
We
generate
alerts
from
these
metrics.
So
as
soon
as
this
goes
above
this
SLO,
then
the
on-call
gets
an
alert.
A
What
we
also
do
so
from
that
we
generate
a
whole
bunch
of.
A
A
C
B
A
A
Oh
yeah,
I'll,
I'll,
I,
won't,
but
so
timeline
timeline
is
like
an
IPython
book,
so
Jupiter
notebook
and
let's
pick
the
redis
one
since
we're
following
that
around
yeah
and
here.
B
This
is
the
piece
I
I
I'd
love
extra
details
on
I
haven't
worked
with
Jupiter
notebooks
at.
A
All
okay,
so
it's
funny
because
these
these
are
not
okay,
normally
you'd
write
the
mark
down
far
from
for
Juniper
notebook
and
the
code
cells
are
python
that
gets
executed.
A
We
generate
our
markdown
files
and
this
is
from
a
service.nd,
DOT
ninja.
So
that's
the
template
for
this
file.
The
front
batter
is
yeah
just
like
an
introduction
to
like
that's
the
rendered
on
every
page,
not
super
important.
It's
this
bit.
A
Then
these
Imports
saturation
forecasts
and
operation
forecasts-
that's
both
of
the
sections
that
are
going
to
be
on
the
page.
If
you
look
at
the
page,
let's
open
up
the
redis
one
foreign.
A
So
first
we've
got
the
reddest
non-hunter
result
horizontally,
scalable
resources,
those
are
the
here's,
the
primary
CPU
one
that
we
was
looking
at
that
we
were
looking
at
before,
and
that
comes
from.
A
A
And
that's
just
a
dump
of
what
you
just
saw
in
the
in
the
runbooks
repository
that
synced
every
week,
I
think,
but
there's
a
bot
creating
merger
quest
for
us
every
time,
saturation
points
and
so
on
change.
This
manifest
gets
updated
and
that's
what
we
use
to
generate
this
markdown
file
that
here
in
the
components
ticked,
has
every
yeah
every
saturation
point
that
we
pass
in.
B
B
A
Little
bit
different,
because
here
we
just
select
all
of
the
operation
rates
for
a
certain
service
and
we
use
the
gitlab
component
operation
right
like
so,
for
every
SLI.
This
is
less
clever,
because
here
we
know
which
ones
we
should
have
the
horizontally
and
the
more
horizontally
scalable
one.
So
we
know
which
one
we
should
have
and
we
print
it
all
out
in
this,
like
we
rendered
it
all
out
in
this
dictionary.
A
So
if
you
see
here,
it
reaches
out
into
this
saturation
forecast.plot
series
method
and
it
prints
out
this
component
stick.
So,
let's
compare
okay.
A
C
A
So,
let's
yeah
go
ahead:
Matt.
B
I
I
wanted
I
wanted
to
to
let
you
finish
and
then
oh
I
had
a
a
couple
of
topics
to
kind
of
explore
as
kind
of
tangents
so
keep
going.
This
is
fantastic.
Okay,.
A
So
first
saturation
forecast
the
that's
the
like.
We
do
this
separation
thing,
but
all
of
the
component
saturation
metrics.
So
those
ratios
that
I
just
shown
you
in
the
Run
books
those
get
rendered
out
in
these
two
first
topics
and
they're.
Basically
the
same
one
has
a
hard
line,
one
or
no
ones
are
marked
horizontally
scalable
and
the
others
aren't
and
we
tweak
the
non-horizolially.
Scalable
ones
is
more
important
because
that's
more
difficult
to
scale.
So
we
want
to
look
at
them
first.
A
So
then,
let's
go
into
the
saturation
forecasting
itself
and
the
plot
series
method
is
what
we're
looking
at,
and
so
we
get
the
page
name.
That's
all
stuff!
That's
just
used
for
the
for
the
report.
Here's
the
component!
Stick!
That's
this
huge
dictionary
that
you
saw
on
the
other
side
that
we
pass
in
and
then
for
each
component.
We
do
a
plot
forecasts,
not
forecasts
here
and
what
that's
doing
it's
getting
a
bunch
of
the.
A
Properties
of
that
component
to
know
here
which
query
to
build
so
like
removing
the
outer
join
labels.
It's
removing
the
the
threshold
because
that's
not
part
of
the
the
labels
in
the
that
are
on
the
series,
and
then
we
get
the
capacity
planning
strategy
to
know
yeah
to
know
which
I
can
show
you
that
and
load
historical
data
frame
to
know
which
metric
we're
going
to
use
like
we
have
these
quantiles.
A
A
When
we
specify
other
capacity
planning
strategies,
then
we
pick
another
query,
but
this
is
the
the
query
that
we're
going
to
be
performing.
In
the
end
this
series
passed
in
here
it
takes
the
dictionary
that
we
just
saw
and
whatever's
left
in
it.
It
gets
turned
into
to
the
to
the
label.
So
the
selector.
A
A
Patched
query
range,
so
we
pass
in
the
query
that
we
just
built
here
on
the
left
and
then
the
start
date
and
the
end
date
timeline
uses
a
180,
Day
history.
So
we
load
all
the
data
from
180
days
ago
to
yesterday
at
midnight
yeah
yesterday,
end
of
day.
B
Out
of
curiosity,
where
is
that
180
days
specified.
A
That
is
in
an
environment
variable,
let's
start
again
at
plot
Series,
so.
A
A
Yeah,
okay,
but
it's
an
environment
variable
and
now
that
we
have
the
cache
we
could
extend
it
but
yeah.
So
where
was
I
querying
with
ranges.
C
A
I'll
show
you
I
was
just
okay,
let's
keep
walking
through,
like
that's
where
we're
going
to
end
up
I'm
fine
here.
So
what
we
do
is
we
batch
the
the
thing
so
180
days
we
can't
load
into
into
yeah.
We
can't
load
it
in
one
query
from
Thanos,
maybe
after
Matt
and
Igor
are
done
with
Thanos
compactor
we
would
be
able
to.
But
I
don't
think
so.
A
So
we
patched
this
in
180
days
and
we
load
24
data
points,
step
3600,
that's
the
the
resolution
that
we're
using
so
we
load
24
data
points
per
chunk
of
one
day
and
the
chunk
of
one
day
is
defined.
This.
B
Is
fascinating,
so
Sanders
query
also
breaks
up
the
Thanos
component
called
Thanos
query
also
breaks
up
large
time
spans
into
into
smaller
ranges
for
caching
purposes,
so
we're
doing
something
similar
here
in
tamman.
B
I
just
learned
about
this
fairly
recently,
thanks
yeah.
A
Me
too,
thanks
to
thanks
to
Igor
walking
me
through
them
like
the
because
you
can
have
overlapping
tunnels
at
that
point,
because
you
have
a
the
raw
data
and
the
down
sample
data
which
have
the
same
yes,
yeah,.
B
A
So
yeah
there
we
will
go
there,
we
go,
we
separated
in
two
days
and
then
we
iterate
like
through
all
of
the
base,
and
this
is
the
loop
that
iterates
and
we
stop.
If.
A
Yeah,
if
we've
reached
the
the
last
date
that
we
need
to
that,
we
need
to
get
now.
As
for
the
cache,
every
query,
the
every
ranged
query
like
this
is
cached,
so
we
do
that
query
range
with
cash.
A
A
We
don't
do
that
anymore.
So
we
have
this
prompt
crash
manager,
thing
that
takes
the
query
and
the
the
from
and
to
date
time
this
is
already
sliced
up.
So
here
we
will
this
from
and
two
will
be
24
hours
long
and
this
step
is
going
to
be
3,
2
600,
so
the
24
data
frames
and
then
we
try
to
read
from
it
manager.
A
This
one
I
think
yep.
Here
we
go
so
that
just
what
does
it.
A
A
It
loads
the
yes,
it
loads
the
entire
day
and
returns
that
data
frame
and
the
data
frame
is
a
pandas
data
frame
upon
this
data
frame
is
the
library
that
we
use
to
query
Prometheus
and
so
on.
So
it
returns
exactly
the
same
thing
as
we
would
have.
If
we
query
your
query
range
would
retry
with
return
return,
the
the
same
kind
of
data
frame
and.
A
B
The
the
width
of
these
individual,
what
we're
calling
them
juncture
ranges.
A
The
width
of
this
wait,
the
width
of
this
data
frame
returned,
is
one
day.
Yes,.
B
Okay,
great
and
that
that
I'm,
just
realizing
that
that
earlier
you
mentioned
that
we
We
sync,
the
the
definitions
for
our
saturation
thresholds
on
a
I,
think
you
set
a
daily
basis.
Maybe
you
said
a
weekly.
B
I
was
just
kind
of
thinking
about
the
the
like.
If,
if
we
made
sorry,
this
I
I
said
that
I'd
hold
questions,
keep
let's
keep
going
with
this.
No,
no,
no
I.
A
See
where
you're
going
I
think
there
is
likely
something
to
go
wrong
with
the
invalid
cache
if
we
TR,
if,
if
we
change
the
same
metric
to
be
different,
suddenly
is.
C
A
A
Sense
yeah,
so
awesome,
okay,
we
read
and
then
we
go
through
the
cache
and
the
cache
looks
like
this.
So
these-
and
this
is
the
part
of
the
thing
that
I
might
want
to
change.
These
are
like
hash
of
the
of
the
query.
So
exactly
the
query
string
that
we
pass
into
tunnels
this
is
it
shot
265
or
something
like
that
and
then
for
each
query.
We've
got
the
day.
A
C
A
Is
that
this
is
it's
just
stored
as
a
directory.
B
A
A
And
no,
not
in
the
git
repo,
it's
not
checked
in
it's,
okay,
it's
written
and
ignored,
but
we
have
this
task,
so
this
is
how
it
loads
the
data.
Do
you
want
me
to
continue
from
here
and
go
into
how
it
gets
to
date
and
caches
the
data
and
make
sure
that
this
thing
doesn't
take
six
hours
to
run?
Or
do
you
want
me
to
continue
into
the
the
forecasting
bit.
B
Let's,
let's,
let's
skip
the
forecasting
for
now
and
and
continue
with
the
mechanics?
Okay,
okay
to
you,
Stephanie.
A
Yeah
totally,
okay,
so
here
the
query
gets
cash
hit
Hub
and
if
it
isn't
it's
it's
there,
it's
written
query
with
range
yeah.
So
here
query
is
non-cash
missed.
We
perform
the
query
with
retries.
So
if
there's
one
failing
it,
then
we
try
again
when
we
do
get
the
data
frame.
We
write
it
out
and
that
happens
here
in
the
directory
structures
that
I
just
mentioned
to
you.
A
This
populate
cache
command,
which
does
like
we.
As
you
remember.
We
went
into
this
coming
from
the
book
that
performs
a
bunch
of
queries
and
and
at
some
point
it
hits
the
query
range
with
cache
method
here
we're
coming
coming
at
it
from
the
other
side,
so
we
don't
start
from
the
book,
but
we
start
from
this
script,
and
here
let's
go
for
we
were
talking
about
saturation
components,
so
here
are
all
components
in
all
saturation
components.
A
A
A
We've
collected
all
those
here
and
then
so
here
we've
collected
all
those,
and
then
we
tell
them
populate
saturation
component
for
each
of
those.
All
of
this
here
with
the
Futures
and
so
on,
is
and
the
thread
fuel
executor
is
a
way
of
doing
this
synchronously
because
most
of
the
time,
because
most
of
the
data
is
cached
for
180
days,
only
one
day
is
not
cached.
So
we
do
this
in
several
threads
to
just
iterate
over
all
this,
like
spinner
wheels
for
over
the
197.
B
A
Of
data
that
we're
going
to
load
see
if
it's
all
there
and
if
it's
not
populated
yeah,
are
we
limit
it
to
five
to
when
we
do
hit
the
new
day
to
not
have
a
Thundering
Herd
on
montanos?
Yes,
so
here's
the
populate
saturation
component-
and
you
see
here
that
it
takes
the
same
plot,
forecast
method
that
we
just
saw
and
it
yeah
does
this
thing
with
from
two
which
are
the
same
forecasting
dates
here
to
load?
All
of
that.
A
So
it's
going
to
go
through
all
of
this
cache
manager,
stuff
here
and
write
out
the
data
directory
and
this
skips
generating
the
forecast,
which
also
takes
some
time,
and
so
it's
slightly
faster
than
running
the
entire
book.
It
skips
the
that's
with
the
historical.
Only
here
skips
calculating
the
forecast,
the
forecast
I'll
get
to
it
later.
A
A
A
A
A
That's
how
the
cache
gets
populated
any
questions
here
before
we
go
back
to
the
forecasting
bit.
A
Then,
if
we're
not
coming
from
the
cache
thing,
this
historical
only
thing
is
set
to
true
for
for
populating
the
cache
and
when
we're
rendering
the
book
it's
false.
So
then
we're
going
to
generate
a
forecast.
A
And
this
is
yeah
where
the
smarts
happened
happens
this
so
This
profit
is
a
little
bit
of
a
black
box
that
you
can
add
some
yeah
parameters
too,
and
that's
going
to.
A
Yes,
so
make
future
data
frame,
that's
going
to
make
the
same
kind
of
pandas
data
frames
like
we
got
from
from
the
Prometheus
queries,
so
we've
got
the
same
kind
of
things.
We
could
actually
yeah
write
them
out
the
same
way
if
we
wanted
to,
but
those
might
change.
Obviously,
so
we
don't.
B
I'm,
probably
just
missing
it,
but
how
do
we
pass
in
the
the
input
data
for
for
a
profit
to
consume?
I,
see
I,
see
where
we're
configuring
daily
and
weekly
seasonality.
B
B
Oh
m.fit,
maybe
yes,
okay,
about
five
lines
down.
B
A
Cool
and
then
here
in
forecast,
we
we
basically
prepared
for
rendering
out
in
a
pretty
graph,
yeah
or
pretty
Eye
of
the
Beholder
and
so
on,
but
yeah
so
yeah.
The
configuration
of
the
of
the
axis
the
graph
lives
in
here.
B
B
Yeah,
that's
fantastic,
so
one
of
the
there
were
there
were
a
couple
things
I
wanted
to
to
to
I,
I,
guess:
I,
guess:
I
I'm,
not
sure
if
I,
if
I
want
to
chat
about
this
or
or
brainstorm
about
it
or
just
kind
of
keep
an
eye
out
for
for
ways
to
to
handle
it.
Gracefully.
B
When,
when
I
manually,
look
at
the
the
trending
history
for
for
even
outside
of
teamland,
just
for
for
any
kind
of
time,
oriented
trending
behavior,
some
some
common
classes
of
gotchas
that
I've
run
into
are
I'm
sure
we've
all
run
into
are
are
having
a
discrete
event
like
like
a
change
in
system.
Behavior,
where,
like
a
workload,
change
or
an
efficiency
or
inefficiency,
was
introduced
or
the
a
defect
in
the
measurement
was
corrected.
B
A
We
can
well
Prophet
supports
it,
so
that
means
they're
like
I,
don't
know
the
code
by
heart,
but
I've
seen
it
in
the
documentation
that
you
can
do
in
this
like
here,
like
we
said
the
daily
seasonality
we
can.
We
can
add,
like
ranges
of
of
that,
that
we
need
to
ignore
so
that's
possible
like
ignore
or
do
something
else
with
like
it's
supported,
or
you
don't
have
a
way
of
doing
it
right
now.
A
So
when
we
have
these
things
in
timeline
and
we
see
them
as
a
human
on
the
graphs,
then
we
currently
say
this
was
this
Mark
this
issue
for
like
how
long
we
think
like
if
it
was
like
a
one
week
thing
it
might
take
a
month
or
so
before
the
prediction
is
like
over
it.
Yeah.
B
I
guess
I
was
I
was
wondering
if
yeah,
that
makes
sense,
so
I
was
kind
of
thinking
of
this
in
as
as
kind
of
a
twofold
topic,
one
is
kind
of
conveying,
like
you
know
when,
when
we
do
this
kind
of
Discovery
work,
it's
nice
to
be
able
to
pass
that
along
to
other
humans,
like
you
know
like
like,
like
we
are
right
now
or
if,
if,
since
we've
caught
a
rotation,
now
being
able
having
a
place
to
kind
of,
say,
hey
for
this,
for
this
particular
I,
guess
they're
not
really
alerts,
but
for
for
this
alert
here,
here's
what
I
found
you'll
this
will
probably
continue
to
to
affect
the
the
projections.
C
B
A
Did
we
did
database
maintenance
or,
like
we
were
close
to
saturation,
because
this
thing
got
deployed
and
it
burned
up
bitly?
We
know
yeah.
C
A
A
bad
ignore
these
three
days
and
then
the
next
prediction
will
be
like
we'll
see
the
dots
of
the
actual
data
when
we
yeah.
B
Some
sometimes
when
we
have
so
I
guess
ignoring
a
particular
range
lets
us
handle
things
like
a
discrete
events
that
caused
an
Abrupt,
limited
duration,
Spike
or
dip,
and.
B
Yeah
exactly
what
about
cases
where,
where
like,
for
example,
we
made,
we
made
the
capacity
increase
which
dropped
the
percentage
utilization
or
we
made
an
efficiency
increase
which
also
drops
the
the
if
the
percent
utilization
like
we
have
with
like
when
we,
when
we
added
capacity
for
when
we
added
memory
to
to
redis
Cache
or
when
we,
when
we
perform
tuning
or
when
we
offloaded
the
you
know,
reduce
the
TTL
for
for
runners,
cash
I'm,
just
thinking
of
kind
of
recent
events-
and
we
do
this
in
in
a
bunch
of
different
components.
B
You
know
largely
as
incident
response,
but
sometimes
just
like
you
know,
plant
work
as
well.
How
do
we
do
you
know
of
a
way
to
to
kind
of
retrain
the
the
projections
to
to
like
where
we
can
give
it
a
hint
that
says
this
was
a
discrete
event
that
that
will
have
a
lasting
change.
A
A
No,
we
we
we
had
that
before
and
we
said
we
don't
need
this
for
redis
cash
because
Max,
it's
always
at
max
memory
and
that's
what
max
memory
is
for
and
then
we
decided
it's
not
what
max
memory
is
for.
So
we
do
need
that
metric
and
then
we
introduced
it.
But
this
is
the
kind
of
event
that
matters
is
talking
about.
A
Like
we,
we
saw
this
coming,
so
we
did
the
thing
and
do
we
had?
This
drop
now
be
smarter
about
it,
like
it's
going
to
grow
at
the
same
rate,
but
yeah
and
I
was
looking.
C
A
B
C
A
And
the
other
one,
the
other
red
is
like
like
if
you
look
at
like
since
it's
single
core,
a
single
threaded
I
mean
then.
B
You
scroll
pass
I
didn't
catch
what
it
was,
but
just
when
the
screen
was
refreshing
there
was.
There
was
one
that
there
was
one
of
the
graphs
that
had
an
Abrupt
drop
in
the
the
and
it
had
the
the
confidence
bands
Arc
out.
B
B
A
Out
about
and
right
now
we
deal
with
that
humanly
like
humans
say
no,
no,
we've
got
this
like
it
often
happens.
We
alert,
we
create
the
issue.
When
these
blue
bands
hit
the
red
one,
then
we
create
a
capacity
planning
issue,
yeah
so
much
stuff
that
I
still
haven't
shown
like
you.
You
don't
know
how
these
Pages
get
generated
or
yeah
how
these
issues
get
created
with
capacity
planning.
C
I
have
a
question
that
is
probably
a
very
simple
yes
or
no
one
at
this
point
in
time.
At
this
point
in
time,
we're
only
focused
on
capacity
planning
in
the
sense
of
something
is
going
to
get
too
big
correct,
we're
not
actually
looking
in
any
way
at.
We
are
wasting
resources
on
this
and
should
use
yeah.
Okay,
no.
A
A
Yes,
for
some
things,
the
if
the
the
the
there.
A
Yeah,
but
if
you
look
at
I,
can
I
can
show
you
one
I
think
if
I
look
for
Trace
chunks.
C
B
Yeah,
like
I,
mean
that
is
being
a
great
example
of
of
a
case
where
we
have
a
bunch
of
CPUs
on
some
of
these
redis
boxes
right,
even
though
redis
can't
use
the
CPUs,
but
explicitly
by
having
a
bunch
of
CPUs.
We
also
a.
A
B
Powerful
machine,
yeah
and-
and
we
get
prayer
prioritized
network
throughput
because
that's
also
stupidly
tied
to
the
number
of
CPUs
I
mean
from
our
perspective.
B
A
Like
from
from
from
that
perspective,
yeah,
what
we're
seeing
here
on
my
screen
is
the
the
single
threaded
so
single
core
CPU
for
registration,
chunks
and
I
bet.
That's
exactly
what
you
were
talking
about,
because
we
provisioned
that
machine
to
handle
a
lot
of
throughput
like
Network,
yeah,
yeah.
B
B
Exactly
so
so
we
can
do
we
can
like
in
in
using
the
Traditions
as
an
example.
We
can
do
an
assessment
like
I,
guess,
I
guess
what
I'm
trying
to
work.
This
is
a
topic.
So
so
we
have,
we
have
kind
of
you
know
just
thinking
of,
like
you
know,
VMS
as
a
as
a
as
a
collection
of
a
handful
of
different
resource
types
right
and
they
get
I
mean
the
way
we're
producing.
These
VMS
is
generally
in
kind
of
standard
kind
of
block
blocks
of
those
resources.
B
So
so,
if
you
want
a
machine
of
size,
X,
then
you're
gonna
get
this
but
CPU,
and
this
much
memory
and
this
much
Network
and
this
much
disk,
and
we
don't
have
to
do
it
that
way.
But
that's
that's!
B
That's
a
common
provisioning
approach
and
I
guess
what
I'm
kind
of
working
up
towards
is
for
a
given
for
a
given
workload,
whether
it's,
whether
whatever
workload
means
in
the
context
of
the
service
that
we're
talking
about
there
are
generally
going
to
be.
B
As
as
human
analysts,
we
can
identify
certain
certain
factors,
certain
factors
that
are
likely
to
influence
I
guess
what
I'm
getting
up
to
is:
there's
always
going
to
be
at
least
one
resource.
That
is
the
most
critical
bottleneck
for
a
given
workload
and.
B
Exactly
so
exactly
so
from
my
perspective,
what
I,
ideally
like,
as
as
a
team
for
us
to
be
able
to
work
towards,
is
building
on
a
knowledge
base
of
what's
the
bending
resources
are
for
for
for
given
workloads
and
services,
what
factors
influence
changes
yeah
in
in
that
workload,
because
changes
and
changes
in
the
in
the
in
the
workload
Drive
changes
in
resource
utilization
and
can
shift
the
bottleneck
to
some
other
some
other
resource
or
component
and
being
aware
of
that
is
that
that
is
just
solid
gold
in
terms
of
being
able
to
you
know,
do
both
capacity
planning
and
incident
response
or
use
cost
exactly.
B
Yes,
exactly
so,
that's
like
you
know,
I,
don't
know
how
much
of
that
Tamlin
can
help
us
with,
but
in
terms
of,
like
you
know,
as
as
a
as
a
team
of
human
Engineers,
that's
what
I
would
ideally
like
us
to
be
able
to
move
towards,
using
whatever,
whatever
the
right
tools
are
for
for
for
that
kind
of
for
that
kind
of
work.
B
But
that's
kind
of
my
my
personal
ambition
for
for
us
to
be
able
to
accomplish,
as
you
know,
as
as
a
team
I
mean,
but
that's
that's
just
one
person's
opinion.
So
I
was
really
curious.
How
how
YouTube
felt
about
that,
as
as
kind
of
you
know,
one
of
our
something
to
work
towards
and
and
also
thoughts
on
how
we
can
kind
of
begin
to
accumulate
this
kind
of
information
in
in
a
structure.
A
B
C
A
How
does
somebody
like,
for
example,
Blake,
go
past
the
graph
like
this
and
say
well,
the
CPU
is
not
utilized.
Why
is
that
yes,
reason
about
this.
B
Exactly
I
I
kind
of
feel
like
this,
you
know,
having
kind
of
a
set
of
you
know
a
set
of
kind
of
notes
about
about
bending
resources
and
known
factors
that
influence
utilization
of
those
resources
could
be
organized
on
kind
of
a
per
service
basis
and
I
had
at
one
time
a
couple
years
ago,
thought
that
we
might
use
the
Run
books
for
that,
because
we
have
sort
of
a
similar
need
in
terms
of
incident
response
to
do
service,
or
you
know,
service
oriented
triage.
B
The
Run
books
are
kind
of
a
mess
in
terms
of
the
the
organizations
you
know
not
lacking.
There's
been
several
kind
of.
You
know
attempts
to
tidy
that
up,
so
I
guess
what
I'm?
What
I'm
kind
of
working
up
to
is?
B
I
I
don't
really
care
where
you
put
it
as
long
as
it's
accessible
to
to
to
folks
and-
and
we
have
enough
freedom
to
to
you-
know
to
organize
it
as
the
more
we
do,
the
the
more
we'll
understand
about
what
what
structure
is
useful
for
for
our
purposes
for
for
doing
doing,
analysis
and
forecasting
and
I
think
once
we
kind
of
you
know
have
done
a
little
bit
more
of
this.
As
a
group,
it'll
it'll
be
more
clear.
B
What
what
aspects
of
of
organizing
this
information
is
useful
and
reusable
I
end
up
in
past
work
prior
to
get
lab.
We
used
a
Wiki
for
this,
we're
not
big
on
wikis
here,
but
yeah
and.
A
A
In
some
fashion,
what
do
you
like,
because
now
I'm
thinking
like
yeah
starting
the
South,
like
everything
in
the
Run
books
and
saying
API
like
the
web?
Everything
there
is
mostly
memory
bound
right
and
we
just
Mark
that
on
the
service.
A
B
May
be
an
interesting
service
because,
depending
on
depending
on
like
we
can,
we
can
drive
Italy
to
CPU
and
memory
saturation
either
con.
You
know
at
the
same
time
or
separately
like
there's
exactly
exactly
yeah,
exactly
like
representing
a
kind
of
a
cascading
failure
scenario
like
I'm
gonna,
I'm
gonna,
just
as
just
as
just
to
have
a
concrete
example
to
talk
about
now
that
we've
rolled
out
c
groups,
there's
a
there's,
a
there's,
a
pattern
where
I
mean
this
can
happen
without
Z
groups
too.
B
But
it's
easier
to
happen.
It's
easier
to
talk
about
in
the
context
of
a
c
group.
So
you
get
one
project
that
has
say
someone's
got
a
say:
someone's
got
a
a
fork
of
Linux
kernel
or
a
fork
of
gitlab
or
gitlab.
B
Something
that's
got
a
lot
of
a
lot
of
git
objects
in
its
history
and
they
do
and
they
do
I
shouldn't
talk
about
the
specific
mechanics
of
this
particular
kind
of
abuse,
but
say
they
run
some
grpcs
that
are
particularly
memory
intensive
and
require
a
trip,
a
long,
long,
lasting
traversal
of
the
object
history
and
that
that
drives
up
both
CPU
and
and
memory
utilization.
B
But
in
this
context
we're
going
to
imagine
that
we
run
out
of
memory
first,
this
this
and
this
Anonymous
memory
usage,
depletes
the
file
system
cache
and
whatever
scope
we're
working
in
in
this.
In
this
case
a
suppose,
it's
a
c
group,
so
just
just
to
put
some
numbers
on
it.
These
are
number
numbers
are
smaller
than
what's
in
production,
but
say
we've
got
a
10
gigabyte
budget
for
for
memory
in
the
C
group
and
normally
that's
plenty.
B
Normally,
that's
mostly
found
some
cash
pages,
but
when
we
run
this
particular
workload,
each
of
the
get
each
each
time
this
grpc
gets
called
it
gobbles
up,
say
four
gigabytes
of
anonymous
memory
and
holds
it
for
a
minute.
So
if
you
get
two
two
or
three
these
running
concurrently,
then
you
just
kicked
out
the
entire
custom
cache
and
at
that
point,
any
other
action.
Whether
it's
these
you
know
poisonous
commands
or
not
has
to
do
a
lot
of
more
disk.
B
I
o
so
at
that
point,
we've
shifted
the
constraining
resource
from
memory
to
disk
to
to
block
I
o
for
for
the
for
the
scope
of
the
project
that
runs
in
that
c
group
I'm,
not
sure
how
we
capture
that,
but
as
as
humans,
we
can
write
some
prose
that
describes
this
pathology,
but
I,
don't
know
how
we
capture
that
in
a
forecasting
framework,
except
it
does
have
some
interesting
properties
like
that
state
transition,
where
we
abruptly
switch
to
having
a
large
increase
in
in
Block,
IO,
I,
guess:
I'm
I
guess
this
particular
like
the
scenario
I'm
kind
of
talking
about
is
is
like
more
of
an
incident.
A
As
well,
we
do
like
the
way
we
match
these
saturation
points
to
Services
is
like
through
tags.
Okay,
let
me
show
it's
a
bad
example
now,
but
like
do,
we
already
have
saturation
metrics
for
the
newly
introduced
c
groups
and
do
we
need
them.
B
B
B
B
For
for
memory
in
particular,
we
we
we
it's
normal
for
the
c
groups
to
have
to
have.
You
know
approximately
100
memory
usage,
but
most
of
that
memory
should
be
fasted
cash,
Pages,
not
Anonymous
memory.
So
that's
differentiating
between
there's
between
different
types
of
memory,
usage,
C
advisor,
doesn't
do
a
fantastic
job
of
advertising.
The
way
in
which
memory
is
used.
B
So
it's
it's
kind
of
like
it's
kind
of
like
at
that
host
level
where,
like
you
know
for
for
Linux,
unlike
some
os's
Linux
prefers
to
have
a
very
small
amount
of
actually
totally
unused
free
memory,
but
it's
it's
it'll
use
most
of
its
most
of
its
memory
for
custom.
Cache
it'll
give
up
those
pages
whenever
processes
need
need
to
allocate
Anonymous
memory.
A
Right,
but
so
what
I'm?
What
was
driving
at
is
that
we
have
a
mapping
of
which
Services
depend
on
which
of
these
resources
that
could
get
saturated.
Okay
applies
to
here.
So,
for
example,
here's
the
CPU
saturation,
one
that
is
using
the
metrics
catalog
to
find
everything
that
is
provisional,
VMS
and
that's
based
on.
A
So
here
so
we've
got
these
tags
for
some
code
things
and
then
we've
also
got
deployment
deployment,
deployment.
A
Just
think
that
also
General,
like
it's
a
it's,
a
stanza,
that's
in
this
thing
that
says
this
is
deploy.vms.
This
is
deployed
on
kubernetes.
A
Here,
no,
that
service
dependencies.
A
B
B
B
Oh
okay,
so
you're
thinking
about
I,
I
think
what
you're
getting
at
is
in
in
the
context
of
kubernetes
node
pools.
We
may
have
some
heterogeneous
workloads
where
you've
got
some
some
pods
that
are
doing
API
work
and
some
parts
that
are
doing.
A
B
A
B
Got
it
yes,
okay,
yeah
I
started
I
I
was
thinking
of
I
was
thinking.
You
were
working
up
towards
towards
contention
between
services
that
are
sharing
that
are
sharing
the
resources
of
a
single
VM
I.
A
Think
we're
far
but
you're
we're
far
away
from
being
I
like
that
was
the
thing
like.
That's
the
dream
right
like
right
this,
this
reddish
thing
is
not
doing
quite
so
much.
Let's
put
some
CPU
and
intensive
sidekick
jobs
on
it.
Like
sounds
brilliant,
but
it's
very
scary.
Yeah.
B
Yeah
I
yeah,
yes,
I'm,
remembering
some
bad
outcomes
from
yes
agreed.
Okay,.
B
Yeah
so
yeah
so
grouping
yeah
having
having
some
way
to
to
annotate
to,
like
you
know,
to
Blake
as
a
as
another
consumer
of
this
data,
to
show
that
these
resources
are
are
collectively,
provisioned
and
and
among
these,
these
five
resources.
This
is
the
one
that's
abandoning
resource
and
and
the
other
resources
kind
of
come
along
for
the
ride.
Yeah.
A
B
Yeah
sometimes
it's
it's
hard,
though,
like
like
sometimes
you
you
just
seeing
the
graph
and
and
the
the
collection
of
related
graphs
isn't
enough.
You
need
to
see
some
more
about
the
reasoning
like,
for
example,
we
just
I
I,
think
the
three
of
us
were
just
just
had
an
example
of
this.
B
A
couple
days
ago,
where
we
were
talking,
we
were
trying
to
reason
about
the
the
the
sizing
for
the
VMS
for
the
registrate
limiting
pods
to
run
on,
and
we
kind
of
rehashed
the
the
the
in
in
kind
of
Rapid
fashion,
the
the
series
of
discoveries
that
you
know
from
like
the
last
couple
of
years,
where
we're
like
okay,
so
we
we
know
that,
like
you
know,
the
the
copy
on
rights
during
rdb
backups
means
that
we
need
to,
you
know,
have
up
to
double
the
max
memory,
and
we
also
have
the
you
know.
B
More
recently,
we
learned
about
the
the
redis
replication
buffer
times,
number
of
replicas
being
a
factor
for
certain
workloads
that
take
like,
like
red,
as
cash
in
particular,
is
big
enough
that
that
that
was
something
we
had
to
pay
attention
to,
whereas
while
they
read
as
clusters,
we
didn't
have
to
so
I
think
having
having
this
kind
of
I
feel
a
little
bad
about
this.
But
this
this
makes
me
feel
like
we
need
a
place
to
put
some
some
written
prose
to
describe
the
interactions
between
these
resources
and
kind
of
the
reasoning
behind.
B
B
B
Like
service
visit,
like
descriptions
of
the
the
you
know,
concise
descriptions
of
the
reasoning
behind
the
sizing
decisions
of
a
given
a
given
set
of
resources,
whether
that
set
of
resources
is
like
you
know,
all
of
the
res,
all
of
the
metrics
representing
resources
for
a
given
VM
or
type
of
VM,
or
if
it's
kind
of
in
a
more
abstract
fashion,
like
most
of
the
time
when
we're
talking
about
resources.
We're
talking
about
like
machine
resources,
CPU
memory,
Network
disk.
B
Yeah
exactly
exactly
so
that
also
I
think
benefits
from
having
a
little
bit
of
a
context
around
around
some
some
some
sizing
and
scaling
choices
like
I
guess,
I
started
off
thinking
about
sorry,
I'm,
not
leading
up
to
anything
I'm
just
kind
of
talking
as
talking
through
having
like
there
are
a
lot
of
people
that
are
contributing
to
to
to
decisions
about
how
how
we're
making
you
know,
tuning
and
and
optimization
trade-off
choices.
B
For
for
the
timeline,
trending
we're
going
to
kind
of
ReDiscover
I
think
some
of
those
some
of
those
you
know,
decisions
being
made
and
having
having
a
way
or
or
a
place
to
kind
of
you
know-
and
maybe
maybe
this
starts
off
as
just
the
timeline
issues
themselves.
This.
A
B
A
B
A
C
That
I
made
it
go,
drop
off
and
learn
about
being
on
call,
but
I'll
watch
the
last
two
minutes
of
this.
If
you
keep
talking.
A
B
A
A
A
B
A
Going
to
be
great,
like
I've,
not
left
you
in
a
in
a
pretty
good
place,
but
let's,
let's
prepare
next
week
somewhere,
okay,
yeah
on
doing
a
run
through
and
I'll
do
a
run
through
and
make
sure
that
the
book
is
ready.
But
I
broke
things
the
past
two
weeks
and
okay
like
yeah.
A
That's
the
part
that
I
haven't
shown
and
I
actually
wanted
to
talk
with
Sean
about,
but
like
the
the
bits
and
pieces
with
pipelines
on
Ops
and
pipelines
on
gitlab
and
the
pages
site
on
gitlab
and
yes,
the
two
different
projects,
capacity,
planning
and
timeland.
And
then
the
scalability
project
is
adding
images
to
the
timeline
dishes.
So
it's
a
little
bit
of
a
yeah
right.
B
Yes,
awesome,
yeah,
I
I
will
I
I
would
love
to
get
a
walk
through
yeah
I
I
just
barely
started
to
look
through
some
of
the
issues,
and
it
was
kind
of
clear
that
there
was
some
context.
There
is
missing,
so
I
thought
pairing
on
on
it,
for
you
know,
for
for
the
for
the
first
round,
or
so
it
would
be
super
helpful.