►
From YouTube: SIG Instrumentation - usage-metrics-collector deep dive
Description
usage-metrics-collector deep dive
A
B
Recording
to
the
cloud
welcome
everyone
to
today's
edition
of
usage,
metrics
collector
sub
project
meeting,
not
the
regular
Sig
instrumentation.
It
is
Wednesday,
February,
8,
2023
and
I
believe
we're
gonna.
Kick
it
off
with
a
very
exciting
demo
and
then
we'll
follow
up
with
a
presentation
and
we'll
have
opportunities
for
people
to
ask
questions
today
is
going
to
be
a
deep
dive
on
the
usage
metrics
collector
sub
project,
so
Phil.
Please
take
it
away
with
the
demo.
C
Sure
I'll
just
start
with
the
demo
here,
so
I'm,
going
to
start
out
with
just
end
result,
which
is
like
this.
In
this
case
Prometheus
instance,
I
got
so
I
got
a
cluster
running
over
here.
It's
just
a
random
gke
cluster
I
created
nothing
too
special
about
it,
except
it's
using
c
groups,
V1
and
so
I
deployed
our
application
with
Coupe
control,
apply
import
forwarded
and
I
got
this
Prometheus
instance
running
as
part
of
it,
and
so
this
should
be
pretty
familiar
to
everyone
here.
C
What's
interesting,
I
think
is
that
these
are
the
new
and
metrics
published,
or
here
are
some
of
the
new
metrics
published
by
The
Collector,
and
these
are
the
utilization
metrics,
which
are
kind
of
the
more
interesting
of
them,
because
they're
the
ones
that
you
don't
get
the
you
don't
get
quite
the
same
thing
when
you're
using
Coop,
State,
metrics
or
C
advisor.
So,
for
instance,
we
have
the
max
utilization
for
a
workload
as
one
of
the
metrics
here,
and
so
what
this
one
is,
is
it's
at
a
one?
C
Second
sampling,
so
you
sample
every
second,
the
CPU.
In
this
case,
you
get
five
minutes
worth
of
samples,
so
that's
300
samples.
C
What
was
the
max
sample
over
that
five
minute
period
And?
So
in
this
case
and
then
the
workload
name
here
and
so
you'll
see
here
it
has
the
container
right.
So
you
can
look
at
in
this
case,
there's
just
one
container
and
then
various
other
labels
associated
with
this
particular
workload.
It's
running
as
a
demon
set.
C
This
is
the
name
and
So
within
all
the
containers,
in
this
case,
or
with
all
the
instances,
so
there's
the
300
samples
per
five
minutes
and
then
there's
the
various
replicating
three
replicas
one
for
each
one
for
each
node
right
and
so,
which
case
you
have
900
or
so
samples
every
five
minutes.
C
What
was
the
max
sample
that
we
saw
okay,
and
so
that's
about
Point,
one
CPU
0.09
CPU
right,
but
then
you
also
have
these
other
ones
which
are
like
what
was
the
P95
sample
of
those
thousand
or
so
what
was
the
median
of
those
thousand
or
so
and
what
was
the
average
of
those
thousand
or
so
right?
C
So
this
is
you
know,
and
it
automatically
does
the
walking
from
the
pods
into
the
workloads
to
resolve
all
the
the
replicas
and
such
so
that's
kind
of,
like
I.
Think
one
of
the
more
interesting
examples
of
what
this
can
do.
One
thing
it's
doing
as
well
is
I
just
wrote
this.
This
code
isn't
checked
in
yet,
but
I
wrote
us
just
kind
of
for
the
demo.
C
You
know,
searching
for
workloads
that
are
maybe
poorly
tuned
or
what
are
your
workloads
that
are
having
the
highest
requests
or
what
are
workloads
that
are
using?
You
know
really
requiring
on
their
limits
a
lot
more
than
than
maybe
you
want,
and
so
in
this
case,
I
just
did
a
pivot
and
so,
and
it
just
has
a
sidecar
that
dumps
that
data
into
bigquery
I
also
have
it
dumping
into.
C
The
cloud
storage
over
here,
so
if
you
want,
for
instance,
to
do
other
sorts
of
analytics
I,
think
a
lot
of
what
we're
trying
to
do
is
making
it
like.
How
do
we
make
it
simple
to
to
do
sophisticated
stuff?
With
this
that
may
be
prom
ql
queries
are
just
you
know
not
well
situated
to
do
oftentimes.
You
know
you
get.
C
If
you
want
to
look
at
maybe
30
days
worth
of
data
prom
doing
something
sophisticated
in
prom
ql
is,
it
starts
to
hit
its
limitations,
and
so
that's
why?
Having
archives
of
this
data
or
or
putting
them
in.
C
Sources
gives
you
capabilities
that
you
wouldn't
otherwise
have
and
then
yeah
that's
that's
kind
of
the
main
bits.
I
can
maybe
just
walk
through
quickly.
C
You
know
what's
going
on
here,
so
we
got
a
couple
Samplers
here.
These
are
the
things
sampling
on
each
of
the
nodes,
and
then
we
got
this
collector,
which
does
the
aggregation
in
our
grafana
and
Prometheus
instances,
and
so
you
can
do
something
like
this
to.
If
you
want
to
just
debug
and
see
what's
going
on,
you
can
just
of
course,
curl
that
collector
instance
and
it'll
dump
all
the
stuff
out
there.
B
So
Phil
just
to
avoid
defining
things
by
like
their
name,
so
the
the
node
sampler
is
the
thing
on
each
of
the
nodes
that
is
looking
at
all
of
the
workloads
and
collecting
utilization
metrics.
And
then
the
collector
is
the
thing
scraping.
Those
Samplers
and
doing
cool.
B
I,
don't
know
if
that
that
helps
anyone,
but
I
heard
like
the
sampler
is
sampling
and
I'm.
Like
yes,
I
know
exactly
what
that
means.
You
might.
C
B
Let
me
quickly
just
say
the
the
question
in
chat:
how
are
metrics
data
exported
into
bigquery?
Does
it
upload
metrics
data
file
into
bigquery,
so.
C
Yeah-
and
that
was
done
like
as
as
like
what
would
be
cool
to
do
as
part
of
the
demo
right
and
and
so
I
wrote
a
script
that
just
uses
the
go:
laying
bigquery
there's
a
load
API
that
loads
Json
and
the
trick
there
was
that
you
have
to
update
the
schema.
So
one
thing
with
bigquery
is,
if
you
add
a
label
right
to
your
metrics,
and
then
you
try
and
push
that
into
bigquery.
It's
gonna
this.
The
query
has
to
have
those
additional
labels,
and
so
it
basically
just
reads
the
so.
C
The
collector
dumps
there's
an
option
in
the
config
for
The
Collector
to
dump
out
a
Json
file
each
time
it
does
a
collection
right
into
that.
Json
file
is
actually
a
format
that.
C
Can
read:
Norm
natively,
it's
just
new
line,
delimited
Json
Blobs
of
the
of
the
metrics,
and
so
what
we
do
is
just
or
what
that
that
sidecar
does
is
just
reads
those
files.
You
know,
com
list,
the
directory
every
minute
or
so
reads
those
files
looks
at
what
all
the
labels
are
infers
the
schema
and
then
sends
a
load
command
to
bigquery
with
that
schema,
and
then
it
will
automatically.
E
Yeah
I
was
just
wondering
like
what
kind
of
cardinality
have
you
tested,
because
workload
names
are
unbounded
and
you
can
theoretically
have
huge
clusters
right
with
many
I'll.
C
So,
where
we're
like,
if
you,
if
you
look
at
what
we're
trying
to
replace
what
we're
trying
to
replace,
is
doing
a
c
advisor
lookup,
all
the
pods
joining
the
utilization
of
that
with
the
Pod
info
with
KSM,
then
with
the
Pod
labels
from
KSM,
then
with
the
node
info
from
KSM,
then
with
the
node
labels
from
KSM
right
and
the
namespace
labels
from
KSM
right
and
you're,
basically
doing
a
massive
join
of
all
the
pods.
C
You
know,
and
so,
if
you
look
at
I'd,
say
we're
looking
at
more
at
the
cardinality
that
we're
reducing
right,
which
is
you
know,
orders
of
magnitude,
Maybe
100x,.
C
I'm
gonna
I'm
gonna
refrain
from
from
saying
the
exact
cardinality
just
because
I
can't
remember
what
we've
we're
pretty
clear
to
talk
about
in
terms
of
what
we're
doing.
B
A
question
in
chat,
which
I
think
was
just
kind
of
covered.
Maybe
this
will
be
covered
later,
but
to
what
degree
is
this
intended
to
replace
the
advisor
metric
collection?
It
doesn't
need
to
entirely
replace
it,
but
it
can
be
used
as
an
alternative
mechanism
to
see
advisor.
It
looks
at
the
same
things
that
c
advisor
does,
but
is
a
little
bit
less
CPU
intensive,
and
only
because
it
only
looks
at
a
small
subset
of
what
C
advisor
can
collect
it's
just
looking
at
CPU
and
memory.
B
C
It's
meant
to
replace
the
doing
the
thing
with
C
advisor
that
people
really
hate
doing,
which
is
those
really
complicated,
joins
across
really
high
cardinality
data.
To
get
like,
like
the
workload
one
I
demonstrated,
for
instance
like
how
do
you
figure
out
the
workload?
Well
then,
now
you're
joining
with
the
owner's
reference,
but
the
owner's
reference
from
KSM.
Isn't
it's
like
in
order
to
walk
replica
set
like
how
do
you
walk
from
pod
to
replica
set
to
deployment,
maybe
in
something
higher?
It's
actually
really
really
complicated.
C
Query
to
to
do,
and
so
I'd
say
the
scope
of
it
is
limiting
it
to
replacing
the
pieces
that
the
advisor
really
isn't
isn't
trying
to
do.
It's
not
you
know,
would
I
use
this
to
replace
like
the
one
thing
C
advisor
does
is
like
I
want
to
debug
apod,
because
there's
one
pod,
that's
must
be
like
is
is:
has
too
much
CPU
one
just
one
pod,
not
the
workload
right,
not
broader
patterns.
C
We
we
categorically
have
said,
that's
really
not
what
we're
trying
to
do,
and
so
for
for
that
use
case.
We
we
I've
advised
people
yeah
continue
to
use
the
advisor
for
that.
Don't
come
looking
at
our
things,
for
it.
B
So
we
do
have
one
more
question
in
the
chat,
but
I
think
that
one
might
get
answered
in
the
presentation.
So
why
don't
we
go
and
do
the
presentation
and
then
we
can
come
back
to
q.
A
I
also
wanted
to.
Let
folks
know
like
I'm,
happy
to
read
your
questions
and
chat,
but
also
like
feel
free
to
put
your
hand
up,
and
you
can
unmute
and
ask
the
question
as
well,
but
without
further
Ado
About
You
want
to
take
us
away
with
your
presentation.
B
D
Sweet
so
just
a
little
bit
about
the
design
of
The
Collector,
so
the
main
piece
is
the
actual
collector
itself.
Let
me
find
my
mouse
here:
I
can't
find
it
oops.
D
Hi
this
is
my
mouse
disappear,
oops,
well,
I
guess:
I
can
go
into
presentation
mode,
because
I
can't
see
my
mouse
anyways,
the
so
the
main
pieces
of
The
Collector
itself
it
takes
in
a
config
and
I'm
gonna,
develop
a
lot
deeper
into
the
config,
because
it's
how
it
sets
up
all
the
aggregations
that
Phil
showed
so,
for
example,
you
would
set
the
resources,
the
extension
labels,
the
aggregations
themselves,
the
local
samples
that
Phil
was
pushing
to
bigquery,
sidecar,
metrics
and
so
on.
D
And
the
second
piece
is
the
node
sampler:
that's
what
actually
gets
utilization
metrics
from
the
nodes,
be
it
using
container
D
or
crawling
the
C
group,
the
C
group
slot
system,
and
also
you
have
a
controller
piece.
That's
what
gets
your
resources
from
the
API
server
to
get,
maybe
requests
allocated
or
your
limits
and
quota
and
so
on,
and
this
collectors
also
a
Prometheus
exporter,
because
you
you
want
to
expose
all
the
metrics
you
aggregated
via
a
metrics
endpoint.
D
D
It
pushes
metrics
to
The
Collector,
so
you
have
metrics
like
CPU
memory.
Cpu
throttling
boom
kills
as
well,
and
so
the
first
kind
of
set
of
metrics
it
pushes
would
be
the
container
metrics.
So
you
can
get
those
by
walking
csfsc
groups
and
but
it
also
supports
getting
metrics
here
to
contain
the
D
socket.
If
you
have
a
pod
runtime
like
for
kind
of
containers
where
you
can't
get
that
by
walking
through
the
c
groups.
D
So
then
you
also
have
node
metrics
by
c
groups,
so
you
can
kind
of
glob
at
different
levels
and
get
metrics
for
those
as
well,
and
this
is
also
this
is
a
config.
You
can
pass
this.
The
Globs
by
config
I'll
mention
that
a
little
bit
in
the
next
slide.
So
this
takes
some
configurations.
The
buffer
size,
for
example,
Phil,
was
showing
one
second
sampling
and
buffer
size
of
five
minutes.
So
300
samples,
that's
also
configurable.
D
How
often
you
push
to
The
Collector,
it's
configurable
as
well,
and
also
the
no
double
Globs
I
mentioned
in
the
previous
point,
all
right,
so
the
metrics
collector
itself.
So
we
have
that
config
I
talked
about
here.
You
would
specify
what
resources
you're
interested
in
what
are
the
sources
of
the
metrics,
the
aggregations
and
the
extensions
I'll
go
deeper
into
that
in
the
in
a
slight
ahead.
D
It's
also
the
aggregation
engine
that
actually
performs
the
aggregations
and
exposes
those
metrics
as
well,
and
then
you
have
the
controller
that
would
list
the
resource
quota
to
get
other
sources
from
that
and
pods
nodes,
bbcs
or
namespaces.
D
And
then
you
have
the
exported
piece
that
actually
exports
exposes
the
metrics
endpoint
that
you
could
scrape
all
right.
So
the
config,
where
all
the
magic
happens.
D
So
the
first
thing
in
the
config
is
the
actual
resources
you're
interested
in.
So,
for
example,
you
have
CPU
memory,
storage
and
there's
a
resource
type
called
items,
for
example,
if
I'm
interested
in
it
interested
in
some
a
metric
on
maybe
an
IP
class
on
a
pod
and
I
want
to
count
that
on
the
namespace.
So
how
many
IP
classes
are
being
used
in
some
namespace
I
can
expose
this
as
an
item.
It's
just
essentially
a
count
all
right,
so
you
also
have
extension
labels.
D
So
these
are
labels
that
are
not
built
in,
for
example,
container
name,
pod
name,
those
are
built-in
labels,
but
what,
if
I
want
a
label
that
is
Maybe
a
Prometheus
label
that
is
an
annotation
and
a
label
on
a
pod
or
a
namespace
or
a
node,
or
even
a
note
taint
so
for
a
container
running
on
a
node
that
has
a
taint
of
no
schedule.
D
I
want
to
have
that
label
in
that
container
metric,
so
I
would
use
these
extension
labels
to
get
that,
and
then
you
have
aggregations
which
Define
the
actual
sources,
where
you
get
the
metrics
and
how
they
get
aggregated
and
how
the
metric
that
actually
gets
emitted
for
those,
and
then
you
also
have
something
called
sidecar
metrics.
So
if
you
have
metrics,
you
want
to
expose
that
are
not
part
of
the
collector.
You
can
also
use
a
sidecar
to
write
those
metrics
to
disk
and
rotate
them,
and
then
you
just
pass
the
folder
to
The.
D
Collector
and
it'll
expose
those
metrics
as
well,
and
then
this
is
the
part
that
relates
to
the
bigquery
piece
that
or
GCS
or
whatever
Object
Store
you
want.
So
you
you
can
also
save
local
samples
for
the
metrics
in
either
Proto
form
or
Json
to
disk.
D
Okay,
that's
it
about
config.
Let's
talk
a
little
bit
about
aggregation,
so
sources,
so
your
sources
are
just
where
your
metrics
are
actually
coming
from.
So
you
have
different
types
here.
You
have
containers
C
group
spot
no
quota
and
for
each
type
you
have
different
sources
in
that.
So,
for
example,
here
for
quota
I
might
be
interested
in
resource
code
of
heart
or
quota
use
for
container
I
might
be
interested
in
the
allocations.
Or
limits
all
right,
so
then
you
have
for
the
aggregations.
You
also
oops
oops.
What
happened
there?
D
D
So
a
mask
is
just
a
set
of
labels
and
a
level
name
that
determines
what
metrics
are
going
to
get
aggregated
for
those
label
values,
and
so
for
that
you
have
the
built-in
labels,
and
then
this
is
where
you
pull
in
those
extension
labels
that
I
mentioned
before,
and
then
you
have
an
operation
because
you
have
these
masks
that
specify
what
labels
you
want
to
aggregate
over
and
you
have
the
actual
operation
that's
going
to
get
applied
to
the
samples
that
get
aggregated.
So
we
currently
averages
some
median
mean
P95
is
supported.
D
You
can
also
do
histograms
and
for
histograms
you
have
to
specify
kind
of
buckets
so
that
these
your
samples
can
be
correctly
aggregated
into
these
histogram
packets
and
there's
a
no
export
this
controls.
If
that
level
is
actually
exposed
asymmetric
as
Phil
alluded
to.
Maybe
you
don't
want
to
expose
container
or
pod
level
metrics,
but
you
want
to
do
those
aggregations,
so
you
would
just
specify
the
aggregation
and
then
add
the
no
export
true
and
that
does
not
get
exported.
D
D
All
right,
I
have
an
example
here.
So
here's
an
example
of
of
an
aggregation
on
a
container
type
and
I
am
interested
in
utilization
and
requests
allocated
for
that
container,
and
so
my
masks
are
I
want
to
I
want
my
first
aggregation
to
be
on
the
container.
Here
are
the
built-in
labels
that
I'm
interested
in
maybe
a
container
name
namespace,
the
Pod
name,
priority
class
and
the
EXP
in
the
note
that
that
container
is
running
on
then
for
extensions.
D
Maybe
I
want
to
know
about
the
the
the
the
taints
on
the
note
that
that
container
is
running
on.
Maybe
if
it's
no
schedule
I
want
to
know
about
that,
and
also,
if
I
have
an
annotation
on
the
Node
that
specifies
a
node
pool.
I
also
want
to
include
that
here
and
then
I
have
no
export.
True
I
mean
at
operation
is
average
by
the
way.
D
So
it
averages
all
the
samples
that
match
that
matches
the
label
values
and
then
I
don't
want
to
export
this
metric
and
then
the
next
aggregation
level
will
be
the
Pod.
It's
just
the
same
thing.
It's
just
missing
that
container
name
and
then
the
operation
here
is
the
sum
so
I
want
to
sum
all
the
containers
that
belong
to
that
pot.
D
The
mask
name
I
had
here
so
just
say
spot,
and
this
is
that
operation
I
use
so
sum-
and
this
is
the
actual
Source
itself
so
regress
allocated,
and
this
is
the
resource
I'm
measuring
so
that'll
be
CPU
course
and
those
other
labels
that
are
pulled
from
the
built-in
labels
and
the
extension
labels.
The
same
thing
for
utilization
as
well,
and
that's
it
any
questions.
B
Okay,
so
just
coming
back
quickly
to
how
use
question,
did
the
presentation
answer
your
question.
B
And
Ty
I
think:
did
you
want
to
unmute
and
ask
your
question.
A
Sure
I'm
just
wondering
a
bit
more
about
the
extensions.
My
the
current
way,
I'm
I'm
understanding
it
which
might
be
wrong,
is
you
know
you
mentioned
like
labels
and
annotations
can
be
looked
up
so
under
extensions.
Whatever
key
you
specify,
there
is
going
to
be
it's
gonna
be
looking
for
a
matching
key
in
annotations
or
labels
is.
Is
that
correct
and
then
I'm
also
wondering
like
if
you
would
have
an
annotation
and
a
label
with
the
same
key
is?
Is
there
an
order
to
kind
of
which
gets
checked?
First.
C
C
These
are
what
gets
produced
by
these
sorts
of
things,
and
you
can
see
them
building
on
one
another
here
right
and
some
of
these
are
histograms
and
whatnot,
and
so,
if
you
look
up
at
the
top,
this
is
where
kind
of
the
extensions
are
here
right,
and
so
these
are
all
commented
out.
I
don't
have
any
extensions,
enabled
right,
but
let's
say
I
wanted
to.
C
So
that's
like
you
know
the
Pod
metrics,
if
you
have
them,
but
like
workload,
metrics
app,
metrics,
namespace
metrics,
like
for
the
sum
of
utilization
of
a
namespace
clusterometrics,
like
all
these
sorts
of
things,
I
would
say:
okay
whenever
I
am
building
that
metric
go
look
for
this
annotation
or
this
label
on
the
pot
right,
and
so,
if
it
says
annotation
here,
I'm
looking
for
an
annotation,
if
it
says
label
here,
I'm
looking
for
a
label
on
the
Pod
right
and
then
I
say:
okay
create
the
label
on
the
metric
right.
C
So
not
the
label
on
the
Pod,
but
like
the
metric
label
named
this
right
and
then
copy
the
value
from
the
either
pod
label
orientation
label.
If
it
exists
into
this
new
label
right
her
namespace
labels,
it
would
look
like
this,
and
so
this
applies
to
any
namespaced
object
right.
So
if
I
apply,
let's
say
I
put
in
a
label
on
a
namespace
that
calls
it.
You
know,
project
cool
or
something
like
that.
C
Right
and
I
want
to
track
all
project
cool
namespaces
together
in
some
way
right
then
I'd,
say
Okay,
go
find
the
label
on
a
namespace
project
cool
and
when
I'm
doing
pod
level,
metrics
I
look
up
the
namespace
of
the
Pod.
So
we
do
the
join
here
right
in
memory
through
kind
of
the
informers
cast,
get
the
Pod
get
its
namespace
look
for
the
namespace
label.
Does
it
have
project
cool
if
it
does
copy
that
over
to
this
label
same
thing
with
node
labels
right
so
get
the
Pod
get
the
nodes
on?
C
Maybe
I
want
to
find
certain
nodes.
I
have
different
node
pools
in
my
cluster
and
so
I
want
to.
You
know,
get
metrics
you
know
oriented
around
which
node
pools
are
full
versus
empty
or
something
like
that
or
what's
the
utilization
of
different
node
pools
or
which
node
pools
tend
to
be
high,
more
highly,
have
higher
requests
or
those
sorts
of
things
I'd
go
get
the
Pod
get
the
note.
It's
on
look
for
the
label,
pull
its
label
and
then,
when
I
aggregate
those
pod
level
metrics
up
into
other
dimensions.
C
I
have
these
labels
here
right
and
same
thing
with
the
note
teams,
the
chains
are
done
a
little
differently
because
taints
are
repeated
and
not
just
a
map,
but
you
can
do
the
same
thing.
Looking
for
taints
those
labels,
then
I,
don't
think
I
have
any
examples
here,
but
you
could
see
a
built
in
when
you
want
to
use
them.
There'd
be
something
like
this.
They
appear
in
here
right
and
so
say:
I
want
to
aggregate
doing
a
sum.
C
This
is
probably
the
wrong
level
but
say
you
know
here:
I
want
to
aggregate
on
one
of
those
labels
right,
project,
cool,
true
right
and
so
by
setting
project
cool
equals.
True
here,
when
I
do
the
sum
I'm
going
to
retain
that
label
right
and
so
I'll
now
have
two
distinct
groups
potentially
for
this
workload
portion
of
the
workload
that's
project
cool
in
the
portion
of
the
workload.
C
That's
not
now
for
this
workload
since
they're
all
pods
in
the
same
workload,
they're,
probably
all
going
to
have
the
same
value
for
project
cool,
but
maybe
for
the
you
know,
node
pool
right.
They
might
not
right
and
so
I
have
half
my
workload
on
a
node
pool.
You
know
experimental
and
half
my
you
know,
or
next
version.
C
Maybe
I
want
to
know
what
version
of
the
node
kubernetes
couplet
right,
for
instance,
they're
running
on
and
partition
my
data
that
way
so
this
is
this
is
like
the
extensibility
that,
like
we
know
that
we
haven't
thought
of
all
the
ways
people
are
going
to
need
to
partition
and
annotate
their
data,
and
so
that's
why
all
these
are
built
into
the
rules
of.
Do
you
want
to
do
some?
Do
you
want
to
do
average?
How
do
you
want
to
kind
of
build
up
your
data?
A
B
Great
demo,
great
presentation,
it
doesn't
look
like
we
have
any
more
questions
in
the
chat,
so
I
know
we
promised
on
our
agenda
for
today
which
I
didn't
Link
in
the
chat,
but
I
will
do
right.
Now
we
had
technical,
Deep
dive
q,
a
we've
done
both
of
those
things
but
also
road
map.
Do
we
want
to
talk
a
little
bit
about
road
map.
B
F
Prepare
something
just
like
Jeff:
we
painted
it
internally
on
a
boulder,
and
you
know
the
plan
is
that
the
Road
Runner
is
gonna.
Think
that's
like
a
real
tunnel.
C
I'd
say
I
mean
there's
some
things
I
think
maturity.
The
big
one
is
going
to
be
around
just
maturity.
How
do
we
make
sure
that
that
that
we
don't
lose
data?
How
do
we
keep
the
collector
as
resilient
as
possible?
I
know
that
AJ
has
been
a
challenge
to
get
right
and
just
looking
at
different
different
factors.
There
I
think
the
container
d
is
support
versus
the
c
groups
path
walking.
C
We're
want
to
build
more
support,
for
that
I
mean
it's
it's
out
there
and
and
it's
it's
usable
I-
think
it's
ready
ready
for
use,
but
it
doesn't
have
quite
as
many
miles
I
think
a
big
one
is
c
groups
V2.
We
currently
just
walk
the
C
group,
speed
One
path,
maybe
for
container
D.
It
doesn't,
if
you're
using
container
D.
It
doesn't
matter
whether
you're
using
C
group
speed
one
or
two,
but
for
the
node
level
metrics.
C
B
C
C
Yeah
an
example
you
might
be
wondering
like.
Why
would
you
want
to
do
that
right
and-
and
you
don't
actually
have
to
do
that
through
this
collector
right,
you
could
create.
You
could
spin
up
your
own
service
that
exports.
You
know
whatever
any
metrics
you
want,
but
one
pattern
we
found
is
maybe
metadata
about
the
cluster
itself
that
that
you
don't
get
from
the
cluster
API,
for
instance,
or
or
just
other
sources
of
information
that
you
really
want
exported.
C
At
the
same
time,
you
want
to
make
sure
that
you,
it's
not
another
separate
service
to
monitor
like
one
service
is
down
and
the
other,
isn't
you
start
not
having
all
the
data
so
the
we
found
the
transactional
nature
of
being
like
I,
gotta
scrape.
Then
I
probably
have
all
the
data
at
that
scrape
at
that
point
in
time
versus
versus
a
distributed,
scraping
a
bunch
of
services,
and
maybe
at
a
moment
in
time
you
don't
have
data
from
one
one
particular
thing,
but
you
do
for
another.
Having
that
is.
Is
nice!
G
So
how
does
it
behave
right
now,
if,
like
the
Informer
connection,
is
not
ready
or
the
watch
is
stuck
or
reconnects?
Is
there
no
metrics,
then,
for
this
period
of
time,.
C
C
There's
I
think
it
will
read
from
the
cash
potentially
as
well
depends
on
how
stale
it
is
I
in
practice.
I
think
that's
been
less
of
a
problem.
It
seems
to
re-establish
itself
relatively
successfully.
It's
actually
the
leases.
The
big
one
we
found
is
getting
a
lease
connection,
and
then
that
kills
your
then
then
you're
like
I,
guess
I'm,
not
the
leader
anymore,
and
then
you
die,
but
maybe
the
other
one
is
like
also
not
able
to
get
a
lease
connection.
And
so
you
were
fine.
C
Your
informers
were
fine,
but
because,
because
you
couldn't
get
at
least
you
decided
that
you're
going
to
be
unhealthy.
C
It
up-
that's
not
true.
We
tend
to
run
it
as
one
instance
because
of
because
of
what
I
just
described,
we
found
more
issues
with
not
getting
leased
and
not
getting
the
informers.
You
have
to
be
really.
You
have
to
really
know
what
you're
doing
to
run
it
in,
because
you
have
all
these
node
Samplers
right.
C
This
demon
set
of
one
per
node
that
are
pushing
into
the
centralized
collector,
and
so
what
you
want
to
have
happen
is
you
only
want
the
Prometheus
instance,
for
instance,
to
be
scraping:
The
Collector,
that's
getting
the
metrics
from
those
nodes
right
if
it
starts
scraping
the
other
one,
all
of
a
sudden
you're
going
to
have
this
weird
flapping
of
utilization,
zero
right,
and
so
that's
why?
C
If
you
look
at
the
can,
the
config
has
like
all
these
Readiness
and
health
checks
in
it,
and
we
make
sure
that
when
you
run
an
mode
The
Collector
that
isn't
getting
the
node
samples
is
marks.
Itself
is
not
ready,
and
so
that
way
it
doesn't
get
any
collections.
But
it's
it's
complicated
to
do
are.
G
You
like
right
now,
you
can
only
run
as
AJ.
You
can't
like
Shard
input
from
different
nodes
to
separate
collectors.
C
I
think
like
it's
possible
to
do
something
like
that.
The
sharp,
because,
like
that,
the
way
the
aggregation
works
right,
like
a
big
portion
of
this,
is
doing
all
the
joins
right
and
doing
all
the
aggregation
and
that's
possible
because
you
have
like
one
unified
view
of
everything,
and
so,
if
you're
sharding
it,
for
instance,
you
couldn't
have
the
aggregation
run
across
you
now
you
have
to
do
either
a
hierarchy.
Well,
you
have
partial
aggregations
and
then
filtering
up
or
so
the
answer
is
no.
G
Would
it
make
sense
for
the
all
the
cube,
API
data
you
get
from
the
informal
reflector
that
is
like
pods
and
yeah,
actually
pods
and
nodes
to
collect
that
data
inside
the
node
worker?
Because
you
then
have
at
least
not
sure
you
don't
have
one
collector
that
needs
to
get
all
pods
from
a
single
cluster
because
it
can
get
quite
big.
C
I
mean
we
still
need
to.
It
depends
on
what
the
goals
I'd
say.
This
has
not
been
a
problem
for
us,
I'd,
say
Prometheus
like
it's
actually
really
efficient,
so
it's
surprisingly
efficient
and
Prometheus
is
has
a
much
harder
time
with
the
cardinality
right,
because
it's
only
storing
the
data
right
now
right
and
then
Prometheus
stores.
If
you
wanted
to
store
30
days,
okay,
well,
that's
actually
a
lot
being
a
lot
more
data.
So
for
our
Focus!
That's
that's
the
bottleneck.
C
If
you
wanted
to
Shard
it
like
again,
the
the
when
you
do
the
roll-up
from
pod
to
workload
like
how
do
you
make
sure
that
you
can
whatever's
doing
that,
roll
up
has
all
the
pods
in
that
workload
right
and
then,
let's
say,
you're
also
doing
a
separate
roll-up
from
like
pods
to
node
pools,
okay,
well,
that
one
has
to
have
all
the
node
pools.
Well,
what?
C
If,
like
the
intersection
of
having
all
the
Pod
the
workloads
and
having
all
the
node
pools
right
and
then
and
then
oftentimes,
we
really
want
to
roll
up
to
the
full
cluster
level.
How
what
is
you
know?
How
allocated
is
this
cluster?
Is
it
full?
Is
there
a
lot
of
emptiness
right,
and
so
something
has
to
have
all
that
so
you'd
have
to
do
tiering.
G
C
G
Yeah
and
the
reason
one
is
because
we
we've
seen
metric
server,
which
is
also
like
this
one
single
thing
at
certain
plus
the
sizes
start
to
struggle
and
get
it
like
a
lot
of
mostly
memory
utilization
to
a
point
where
sometimes
it
doesn't
fit
on
a
node
anymore
and
then
and
I
wonder
if
that
might
be
similar
problem.
But
I
think
with
the
aggregation.
You're
doing
probably
a
lot
cheaper,
because
you
don't
have
to
hold
that
many
massive
objects
in
memory.
C
Yeah
and
we
can
do
there's
actually
some
off.
We
have
a
lot
of
room
for
optimization,
because
right
now,
the
way
we
do
the
listing
the
way
the
Pod
list
API
for
controller
runtime
works.
Is
it
basically
copies
the
entire
pod
memory
right,
and
so
we
could
probably
cut
our
memory
in
half
if
we
really
needed
to
by
by
just
reading
directly
out
of
the
Informer
cache
without
making
copies
of
the
objects.
For
instance,.
G
Yeah,
so
you
you
might
would
help
in
a
lot
of
also
operators.
I've
worked
on
like
right
now
is
switching
away
from
Informer.
If
you
don't
need
the
whole
object
and
use
reflect
and
have
your
own
cache
or
there's
the
transforming
Informer.
C
C
Thing
we've
done
I
put
in
an
optimization
to
in
the
reflector
like
have
a
Transformer
to
just
drop
like
start
clearing
parts
of
the
object,
and
we
turn
that
on
we're
like
it's
going
to
be
awesome,
we're
going
to
save
half
the
memory
and
then
like
it
saved
like
two
percent,
we're
like
at
that
just
turn
it
off.
Who
cares.
C
G
And
one
last
question:
the
note
part
is
pushing
metrics
in
proto
right.
So
in
theory
like
if
someone
would
come
around
and
say,
Hey
I
want
those
in
a
different
I.
Want
you
to
do
the
aggregation
somewhere
else
to
run
this
in
a
different
format.
They
could
just
implement
the
same
API
and
tell
all
the
nodes
to
push
there.
C
C
In
theory,
I,
wouldn't
that's
not
the
way,
I'd
start
out,
because
I'd
ask:
do
you
really
want
to
push
300?
Do
you
really
want
to
push
like
that
level
of
granularity
of
samples?
And
the
answer
is
almost
certainly
no
so
then
you're
like
doing
the
aggregations
right
like
okay
I'd,
say
maybe
the
the
more
interesting
thing
I
could
see
you
doing
is.
C
Maybe
you
want
to
have
like
less
granularity,
because
five
minutes
is
too
much,
and
so
maybe
you
want
to
did
do
something
where
you're
doing
averages
over
20
minutes
max
is
over
20
minutes
those
sorts
of
things.
C
One
one
thing
I
didn't
demo
that
we
also
found
was
really
interesting,
is
just
getting
an
idea
of
holistically
like
on
the
Node.
How
much
is
coupons
taking
versus
the
systems
lives?
Are
these
things
well
tuned,
seeing
like
when
you
run
on
large
clusters,
just
the
variance
of
like
turns
out.
It's
not
like
one
size
fits
all,
isn't
perfect
for
the
system,
slice
right
and
you're,
either
under
under
provisioned
or
over
provisioned,
but
almost
no
there's.
B
E
C
B
We
got
12
minutes
left
just
a
time
check,
so
we
do
actually
have
a
decent
amount
of
time
so
feel
free
to
keep
questions
coming
at
us
or
Phil
I.
Don't
know
if
there's
anything
else
you
wanted
to
so.
C
My
favorite
part
of
this
project
has
nothing
to
do
with
the
project,
we're
seeing
instrumentation
or
anything
else.
That's
a
little
piece
of
infrastructure.
We
just
had
to
write
because
we
really
were
tired
of
writing
tests,
and
so
we
write
our
own
test
infrastructure
that
Paul,
Paul
and
I
have
really
patted
ours
and
enable
have
all
patted
ourselves
in
the
back
about.
C
I,
don't
know
I
know
if
people
are
interested
in
seeing
that
something
I
could
show
off,
and
that's
all
right.
So
the
theory
is
I
hate
writing
tests.
C
What
I
really
want
to
do
is
run
my
code
and
then
say:
what's
the
results
of
the
code
and
are
they
what
I
expect
right
and
I
don't
want
to
write
what
the
results
are
and
I
don't
want
to
write
the
test,
and
the
only
thing
I
want
to
write
is
the
input,
and
so
you
know
based
on
that
kind
of
theory,
we
said:
okay,
what
can
we
do
to
make
that
a
reality?
C
And
so
we
wrote
most
of
our
most
of
our
tests
are
just
functional
where
we
have
in
the
collector
here
this.
You
know
these
are
all
directories
and
you
just
create
a
directory
of
test
data,
and
so
we
say:
okay,
here's
a
test
case,
and
so
for
this
we
say:
here's
the
state
I
want
the
cluster
to
have
so
just
load
up
the
cluster
with
this
state.
C
At
this
this
replica
sets
and
then
our
infrastructure
looks
and
finds
like
something
called
input,
client
objects,
yaml
and
just
says:
yeah.
Okay,
I'll
create
the
in-memory
cluster
with
that
and
then
here's
the
spec
for
configuring
it.
So
here's
the
aggregation
rules
and
all
the
various
pieces,
and
then
maybe
here's
this
extra
one.
You
know
other
pieces
of
input
that
you
might
need
to
say.
C
Okay,
like
ignore
these
metrics,
because
you
know
latency
seconds,
obviously
changes
and
it's
not
going
to
be
stable
and
then
here's,
maybe
the
here's,
maybe
the
the
inputs
to
write
for
you
know
what
you
get
from
the
C
group
file
system
right
and
so
each
one
of
these
test
cases
is
just
like
here's.
The
set
of
input
state
that
I
expect
and
then
the
test
basically
just
runs
the
code.
C
You
say:
okay,
go
read
some
of
these
files
and
run
the
code
do
a
little
bit
of
setup
and
then
it
produces
this
expected
file,
which
is
the
output
from
it,
and
so
it
writes
it,
and
so
I
don't
write
this
expected
file.
What
I
do
is
I
just
run
it
and
say,
update
the
expected
file
or
fail
if
it's
not
the
expected
file,
and
then
that
way,
whenever
we
make
changes
to
our
code,
let's
say
we
want
to
add
a
label.
C
You
know
we're
changing
the
expectations
we
want
to
make
this
better.
We
found
a
bug
oftentimes,
we'll
find
a
bug
right
instead
of
going
and
having
to
fix
every
test
case
when
we
fix
that
bug,
we
just
say
update
this,
you
know
run
it
in
update
mode
and
then
you
just
do
a
diff
right,
and
so
then
it
makes
things
really
easy.
Where
I
just
fixed
the
bug
and
say
I
expect,
like
all
these
files
to
have
changed,
show
me
they've
changed
right,
and
so
we
just
kind
of
stamped
these
things
out
with
different.
C
You
know
it
is
a
lot
of
copy
and
paste,
but
it's
actually
you
know
copying
pasting
in
this
format.
I
think
has
worked
out
really
well
for
us,
where
we
just
kind
of
stamped
these
different
test
cases
out.
We
have
a
lot
of
them
because
they're
really
easy
to
write
you
just
kind
of
copy
it.
C
One
test
case
change
the
inputs
to
be
kind
of
what
you
expect
and
so
that,
for
instance,
did
the
test
for
these
things
and
they're,
not
they're,
not
free,
but
you
can
kind
of
see
they're,
not
they're,
not
terrible,
right.
F
C
C
So
it
works
there
right
and
so
then
you
know
you
can
imagine
I
changed
the
code
in
a
way
that
doesn't
match
the
test
file.
We.
F
C
Cowbell
right,
and
so
maybe
this
isn't,
the
great
I
don't
know
I,
don't
love
this
unable
to
save.
This
is
not
what
I
expected.
Oh
here,
it
is
right.
So
then
you
can
see
you
know.
Is
this
the
best
thing
in
the
world?
It's
not.
Let
me
show
you
how
we
expect
the
workflow
to
really
work
right
would
be
so
I've
added
everything
right.
So
it
thinks
like
this
is
the
way
and
then,
let's
see
update
so
maybe
I'll
do
something
like
this.
C
So
pretend
that
the
test
the
test
case
doesn't
match,
doesn't
match
anymore
right,
I've
made
a
code
change
and
now
it
doesn't
match.
What's
going
to
be
produced,
I
can
say
like
just
update
this
stuff
and
then
I
run
it
it's
going
to
run
all
the
tests
and
then,
when
I
go
into
Source
control.
C
It's
interesting,
it'll
show
you
hey,
look
I
made
this
change
like
it
did
say
before
the
expected
result
was
you
know
app
to,
and
now
the
expected
result
is
F1,
and
so
typically,
when
we
write
a
test
case,
we
leave
these
things
empty
kind
of
a
bad
practice.
We
should
probably
we
should
probably
be
a
little
more
careful
about
making
sure
the
test
expects,
expect
expectations.
C
Yeah
I
mean,
and
actually
you
can
see
the
deaths
really
easy.
Then
you
copy
a
test
case.
You
change
an
input,
you,
you
add
all
the
stuff
to
get
run.
The
test
do
the
diff
and
then
you
can
see
yeah
when
I
test
it.
When
I
changed
this
input,
I
see
the
output
changes
in
a
corresponding
way.
That's
that's
good!
So
that's
that's
something!
Anyone
can
actually
borrow
from
this
repo.
If
you
happen
to
want
it
for
something
you're
doing.
D
H
I
have
a
general
question,
for
you
say:
there's
an
organization
that
thinks
this
is
looking
pretty
cool
and
would
like
to
try
it
out.
What
caveats
would
you
have
for
them
in
terms
of
production
Readiness
like
don't,
try
it
on
clusters,
this
big,
don't
try
it.
Let's
configure
or
it's
all
great
just
go
for
it.
C
I'd
say:
don't
don't
try
it,
so
the
the
size
of
the
cluster
is
probably
going
to
be
less
than
issue
than
the
cardinality
of
what
you're
exporting
in
terms
of
what
you're
going
to
blow
up
right.
C
Maybe
maybe
listing
like
this
thing's
going
to
list
all
your
pods
make
sure
you're
comfortable
with
that
sort
of
thing.
Obviously,
any
new
piece
of
software
don't
Don't,
just
run
it.
You
know
on
your
most
important
production
cluster
without
trying
it
like,
since
the
normal
sanity
applies.
I'd
say
be
careful
about
what
which
Prometheus
you
push
this
into
push
it
into
its
own
Prometheus
I'd,
say
even
not
just
starting
out.
You
probably
want
your
own
Prometheus
for
this
stuff.
You
don't
want
to
be
alerting.
C
You
don't
want
the
thing
that
tells
you
when
something's
terribly
wrong
and
alerts
you
to
be
also
the
fire
hose
for
your
utilization
metrics,
so
so
try
and
create
your
own
Prometheus
message.
Instance
for
is
probably
the
most
important
one.
F
I
would
I
would
probably
add,
add
to
that
like
run
it
side
by
side
with
whatever
you're
doing
now
and
like
carefully
examine
the
data,
because
you
might
be
surprised
once
you
start
getting
like
very
high
resolution
data.
What
what
you
may
not
have
seen
before.
C
Yeah,
that's
very
true:
the
yeah,
the
sampling.
It
may
look
a
little.
You
don't
want
it
to
look
wildly
different.
C
So
do
a
sanity
check
on
that
any
any
metrics
that
you're
going
to
make
decisions
off
of
I'd
I'd,
compare
to
see
advisor
and
just
do
the
like
you're
not
going
to
need
to
do
all
the
joins
with
the
advisor
just
to
get
an
estimate.
Like
sum
up
by
names
like
you,
can
do
it
at
a
namespace
level,
for
instance,
without
doing
any
joints
from
C
advisor
and
then
compare
like
the
namespace
level.
Metrics
make
sure
they're
the
same.
Maybe
do
some
individual
pod
metrics
kick
the
tires.
C
One
thing
you're
almost
certainly
going
to
have
to
do
is
the
if
you're
not
using
containerdy,
if
you're
using
the
c
groups
file
path,
walking,
it
relies
on
having
certain
naming
conventions
right.
It
reads
the
Pod
uid
from
the
from
the
c
groups
path.
It
reads:
the
container
ID
from
the
c
groups
path
and
different
different
systems
are
set
up
differently
with
you
know
some
of
them
append
everything
with
DOT
slice.
Some
of
them
are
pre-pinned
of
you
know:
coup
pods
Dash.
First
of
all,
Dash
whatever
to
the
Pod
name.
C
So
if
you're,
if
it's
not
working
out
of
the
box,
you're
gonna
have
to
you,
you
you
might
even
either
either
have
to
understand
what
it's
doing
or
maybe
you
could
ping
us
on
slack
and
say:
hey
I
I'm,
not
getting
utilization.
Metrics
I
looked
at
the
c
groups.
This
is
what
I'm
seeing.
B
The
container
energy-based
stuff
should
avoid
that
problem,
so
you
could
try
that
out,
but
oh
I
guess
another
possible
thing
that
we
could
have
on
the
road
map
is
support
for
more
Cris,
a
sort
of
a
general
thing.
B
I
wanted
to
use
like
a
general
library
and
one
didn't
exist,
so
them
to
the
breaks,
but
that
might
help
out
too.
We
have
one
minute
left.
So
I
guess
any
last
questions.
B
B
If
that
sounds
good
to
people
like,
we
can
just
include
it
in
a
regular
agenda
item
on
our
Thursday
meetings,
update
from
the
sub
project,
but
I
imagine
we
can
work
pretty
asynchronously.
Does
anybody
have
any
particular
feelings,
one
way
or
the
other?
Now
that
we've
demoed
things
we've
done
the
Deep
dive.
B
I
think
we
can
always
have
another
ad
hoc
sub-project
meeting.
If
we
need
one
I,
don't
know
if
we'll
need
regular.
What
at
this
point
so
I
think
it's
totally
fine
to
just
use
the
main
Sig
instrumentation
meeting
for
that.