►
From YouTube: Ceph Days NYC 2023: Why We Built A “Message-Driven Telemetry System At Scale” Ceph Cluster
Description
Presented by: Nathan Hoad | Bloomberg
Ceph’s Prometheus module provides performance counter metrics via the ceph-mgr component. While this works well for smaller installations, it can be problematic to put metric workloads into ceph-mgr at scale. Ceph is just one component of our internal S3 product. We also need to gather telemetry data about space, objects per bucket, buckets per tenancy, etc., as well as telemetry from a software-defined distributed quality of service (QoS) system which is not natively supported by Ceph.
https://ceph.io/en/community/events/2023/ceph-days-nyc/
A
Hi
everyone
I'm
Nathan
Howard,
as
was
alluded
to
by
Mike
I'm,
a
senior
software
engineer
here
at
Bloomberg
I,
am
on
the
distributed.
Storage
team
I've
been
a
part
of
the
company
for
about
seven
years
now.
Let's
move
on
so
a
quick
agenda
for
today.
I'll
talk
a
little
bit
about
who
we
are
as
a
company,
some
more
about
the
storage
group
and
things
like
that.
A
Some
information
about
ourself
clusters
to
give
context
for
what
I'm
talking
about
I'll
talk
a
little
bit
about
stuff's
built
into
lemon
tree
for
those
who
don't
know
and
why
that
Telemetry
doesn't
work
for
us
and
as
a
result,
what
we
wanted
out
of
our
Telemetry
system.
What
our
solution
actually
looks
like
the
results
of
this
I.E?
How
well
does
the
solution
match
up
to
our
requirements
and
potential
plans
for
the
future
and
I
will
be
taking
questions
at
the
end,
so,
firstly,
Bloomberg
we're
a
financial
technology
firm.
We
were
founded
in
1981.
A
If
you
have
not
heard
of
us,
we
are
our
Flagship
product
is
called
the
Bloomberg
terminal,
our
users,
clients,
whatever
you'd
like
to
call
them.
They
use
the
terminal
every
day.
For
data
services,
news
analytics
all
sorts
of
things
for
financial
analysis.
Essentially,
we
have
over
350
Subs
350
000
subscribers
all
over
the
world.
We
process
hundreds
of
billions
of
pieces
of
data
every
single
day,
and
we
do
this.
With
a
force
of
over
7
000
engineers,
storage,
Engineering
Group.
A
We
are
responsible
for
Designing
building
and
maintaining
all
of
the
storage
used
by
Bloomberg
engineering.
We
have
the
three
main
pillars:
file
block
and
objects.
We
have
teams
for
data
protection,
you
know
like
automated
backups
and
things
like
that
for
Hardware
failure
and
whatnot.
We
have
storage
workflows
for
say,
for
example,
you
have
NFS
mounts.
You
want
your.
You
know,
permissioning
to
be
set
up
correctly
and
things
like
that.
You
don't
want
a
human
to
do
that.
It
should
be
done
by
a
machine.
A
Similarly,
we
have
an
automation
team
that
works
on
automating
things,
not
a
big
surprise
there.
For
things
like
you
know,
you
want
to
increase
your
quota
for
your
tenancy
or
you.
You
want
to
create
new
tenancies
and
things
like
that,
basically
letting
our
users
on
board
themselves
and
maintain
their
systems
themselves
rather
than
us
having
to
do
it.
A
The
distributed
storage
engineering
team.
So,
as
you
might
imagine,
we
have
a
focus
on
distributed
storage.
We
are
part
of
the
storage
Engineering
Group.
Our
focus
is
on
software-defined
storage,
with
our
primary
offering
being
object,
level
storage,
ie,
the
S3
API
internally.
We
call
this
Bloomberg
cloud,
storage
or
BCS,
and
this
shouldn't
really
surprise.
A
Anyone
given
why
we're
here
today
it
is
backed
by
ceph,
so
some
information
about
lcf
clusters,
given
that
we
are
very
heavily
object,
focused
it
shouldn't
come
as
a
surprise
to
anyone
that
we
are
heavily
rados
Gateway
focused
as
our
great
friend
Matt
alluded
to
earlier.
Today
we
do
over
a
billion
S3
requests
every
single
day
that
is
billion
with
a
B
and
for
lcf
clusters.
We
have
a
total
of
four.
A
So
what
Telemetry
does
ceph
offer
by
default
by
default?
It
integrates
with
an
open
source
product
called
Prometheus.
This
is
a
popular
monitoring
series
and
time
series,
database
or
monitoring
system
and
time
series
database
I
should
say
the
way
that
it
works.
Is
it
receives
metrics
by
scraping
what
it
calls
an
exporter.
An
exporter
is
just
a
HTTP
server,
with
a
matrix
endpoint
that
metrics
endpoint
returns
a
tabulated
plain
text
list
of
metric
names
and
their
values.
A
Given
this,
you
can
think
of
this
as
a
pull-based
system,
rather
than
push-based
IE
Prometheus
has
to
be
aware
of
where
to
get
the
metrics
from
not
the
other
way
around.
For
visualization
grafana
is
a
really
popular
choice.
You
can
use
prometheus's
built-in
visualization,
but
the
common
consensus
from
what
I
can
gather
is
that
generally
Prometheus
is
querying
and
visualizations
are
used
more
for
debugging,
rather
than
actual
like
user-facing
visualizations
grafana
is
significantly
more
feature-rich
and
it's
a
fun
thing.
A
If
you
do
have
a
ceph
ADM
based
deployment,
you
get
these
things
for
free
out
of
the
box.
Prometheus
and
grafana
containers
will
come
along.
You
also
get
the
node
exporter
container,
which
does
things
like
CPU
usage,
and
things
like
that.
The
grafana
container
also
comes
in
with
a
bunch
of
built-in
dashboards
and
everything,
so
it's
it's
extremely
Plug
and
Play,
and
then
finally,
you
have
the
surf
dashboard
or
surf
web
interface.
This
also
provides
you
a
nice
overview
of
the
cluster.
A
So
why
doesn't
this
work?
For
us?
It
sounds
pretty
great,
but
number
one.
The
documentation
already
notes
that
the
plugin
will
be
slow.
When
you
have
over
a
thousand
osds
from
earlier
I
said
we
have
four
and
a
half
thousand
in
each
of
our
clusters,
so
that
ship
is
well
and
truly
sailed.
Next
is
another
scalability
issue.
A
So
from
this
we
can
come
up
with
a
list
of
our
requirements,
number
one.
It
should
be
as
real
time
as
possible
for
both
self
level
and
radio
scoutware
level
Matrix.
We
want
to
know
things
like
cluster
throughput,
PG
status.
You
know
bucket
count
garbage
collection,
all
of
that
sort
of
thing,
with
the
idea
being
that
the
more
real
time
it
is,
the
more
responsive
you
can
be
to
potentially
arising
issues.
A
Scalability
is
obviously
an
important
thing.
We
want
it
to
be
an
intrinsic
part
of
the
system
that
you
can
scale
tasks
up
and
down
as
necessary.
You
shouldn't
have
to
put
a
lot
of
work
into
breaking
up
your
tasks
so
that
you
get
Fair
distribution
over
your
system
and,
of
course,
that
means
that
it
should
be
a
distributed
system
as
well
and
fault.
Tolerance
should
come
along
for
the
right.
Thirdly,
it
should
publish
into
grafana.
We
have
a
major
installation
of
grafana
here
at
Bloomberg.
A
Basically
any
and
all
metrics
will
go
into
it.
This
means
that
to
give
our
users
the
best
and
most
and
and
expected
experience,
we
should
be
publishing
into
grafana
because
that's
where
they
would
expect
to
look.
And
fourthly,
it
should
be
easy
to
extend
if
you've
ever
run,
ceph
or
rados
Gateway
admin
help
you'll
see,
there's
like
a
million
sub
commands
and
you
can
get
basically
anything
you
can
imagine
out
of
the
system.
A
A
Thirdly,
it's
written
in
Python
This
lends
itself
to
being
very
easy
to
extend.
I'll
have
some
example
slides
later
that
show
some
code
demonstrating
this
and
fourthly,
the
technology
choices
that
we've
made
make
it
really
easy
for
us
to
scale
up
and
down
more
on
that
later,
so
the
pub
sub
model,
you
have
three
major
components:
you
have
a
publisher,
a
message
broker
and
a
subscriber,
as
you
might
imagine,
from
the
name.
The
publisher
is
responsible
for
feeding
data
into
the
system.
A
The
message
broker
is
responsible
for
receiving
that,
storing
it
as
appropriate
and
then
sending
it
out
to
appropriate
subscribers,
and
the
subscribers
are
responsible
for
receiving
this
information
processing
it
in
whatever
application,
specific
logic
you
have
and
then
notifying
the
message
broker
that
the
processing
is
complete
and
it
can
be
removed
from
the
queue.
This
is
nice
because
you
can
independently
scale
any
of
these
individual
components
depending
on
what
your
needs
are
so
say.
For
example,
your
subscribers
are
scraping
websites
and
you
want
to
be
able
to
scrape
more
websites
at
a
time.
A
You
can
either
optimize
this
case
individually,
so
that
you're
processing
them
faster
or
you
can
just
run
more
processes,
so
it's
it
makes
it
quite
easy,
also
in
terms
of
fault
tolerance.
This
is
quite
nice
because
say,
for
example,
this
topmost
subscriber
processing
message
one
crashes.
What
will
happen
is
the
message
broker
will
notice
that
the
connection
has
been
lost
message.
One
has
not
been
successfully
processed
and
it
will
redirect
it
to
the
next
available
subscriber.
A
So
that's
a
high
level
overview
of
popsub.
So
more
specifics
on
our
Tech
stack
for
our
message
broker.
We
went
with
rabbitmq,
it's
it's
highly
flexible.
You
can
do
a
lot
of
different
things
with
it,
depending
on
what
your
needs
are,
and
we
have
a
dedicated
team
at
support
at
Bloomberg
providing
support
for
this,
so
we
don't
have
to
go
through
with
the
maintenance
burden
of
owning
yet
another
system.
A
Secondly,
we
have
salary.
If
you
have
not
heard
of
celery,
it
is
a
python
framework
for
implementing
a
task
scheduler.
The
way
I
describe
this
to
people
is
that
it
is
good
for
turning
code
into
messages
when
you
use
vanilla,
rabbit
mq,
you
have
to
care
about
your
serialization
formats
of
your
messages
as
you
feed
them
into
the
broker.
This,
for
the
most
part,
takes
that
away
from
you.
And
thirdly,
as
I
have
already
alluded
to.
Grafana
grafana
is
important
for
us
and
we
have
a
dedicated
team
providing
support
for
that
as
well.
A
More
information
about
rabbitmq,
so
it
has
client
libraries
in
every
major
programming
language.
This
speaks
to
its
popularity,
there's
a
lot
of
existing
knowledge
out
there.
If
you
have
a
question,
someone
else
has
probably
already
asked
that
question
before
and
someone's
probably
already
answered
it.
A
The
way
that
it
implements
Pub
sub
is
publishes
right
to
an
exchange
and
subscribers
consume
from
a
routing
key
and
the
way
that
you
join.
These
two
things
together
is
dependent
on
the
type
of
exchange.
So,
for
example,
the
most
common
would
be
a
direct
exchange.
That's
a
one-to-one
mapping
of
one
exchange
to
one
routing
key.
Then
you
have
fan
out
which
is
very
similar,
except
it's
more
of
a
broadcast
system,
so
you
can
do
one
exchange
to
many
routing
keys.
And
finally,
you
have
topic
based
so
say
you
had
an
exchange
called
news.
A
This
is
where
our
scalability
comes
in.
Rabbitmq
doesn't
really
care
how
many
subscribers
you
have
connected
so
as
a
result
to
get
vertical
scalability.
You
just
run
more
subscribers
on
the
hardware
that
you
have
and
to
scale
horizontally.
You
basically
do
the
same
thing,
but
with
adding
more
Hardware
you're,
just
running
more
subscribers
on
a
different
machine
and
finally,
it's
boring
technology.
So
what
I
mean
by
this
is?
It
has
been
so
battle
tested
and
it's
been
around
for
so
long
now
that
it's
a
safe
Choice,
it's
no
longer
exciting.
A
It's
like
picking,
Microsoft,
Word
or
postgres
it.
It's
it's.
A
good
thing
like
safe
is
good
celery
as
I
alluded
to
tasks
are
just
functions
like
they're,
just
literal
code,
you
get
to
forget
about
how
you
serialize
messages,
which
is
nice.
It
also
decouples
you
from
your
message
broker.
If
you
decide
for
whatever
reason,
rabbitmq
is
not
for
you,
you
can
move
to
redis
by
changing
a
single
line
of
configuration
and
being
a
task
scheduler.
It
has
all
of
the
regular
Primitives.
A
You
would
expect
from
that
sort
of
system
like
you
can
group
functions,
delay
them
add
callbacks.
All
that
sort
of
stuff-
and
one
thing
that's
particularly
interesting-
is
that
you
can
create
different
categories
of
subscribers
by
specifying
which
cues
they
should
publish
And
subscribe
to,
to
say,
for
example,
you
wanted
to
do
some
graphics
processing
on
a
cluster
of
machines.
You
only
had
a
subset
of
that
cloth
so
that
had
dedicated
graphics
processing
Hardware.
A
You
could
create
subscribers
on
only
those
machines
that
only
consume
those
messages
so
that
you're
making
sure
messages
are
routed
as
appropriate.
This
is
something
that
you
can
do
with
vanilla
rabbit
mq,
but
it's
significantly
more
manual.
Salary
makes
this
quite
nice
and
again
it's
boring
technology.
Boring
is
safe
and
safe
is
good,
so
a
piece
of
code.
So
if
we
look
at
this,
this
is
probably
the
smallest
example
of
a
salary
application.
I
could
come
up
with
at
the
very
top.
Here
we
have
an
instantiation
of
a
salary
object.
A
This
acts
as
both
a
task
registry
and
a
place
to
configure
your
application
for
like
q
names
and
timeouts,
and
things
like
that
and
further
on
down.
We
have
two
functions,
the
first
of
which
is
not
super
interesting.
It
is
called
bar,
it
receives
an
integer
and
it
logs
that
value.
What
makes
this
into
a
task
is
the
task
decorator.
A
A
So
if
you've
used
python
before
you'll
know
that
bar
dot
delay
is
like
the
delay,
method
doesn't
normally
exist
on
a
function
like
that
in
Python.
This
is
something
that
the
task
decorator
will
add
for
you.
This
is
nice
because
under
the
hood,
what
this
is
doing
is
it's
serializing,
a
message
saying:
we
would
like
to
asynchronously
call
the
bar
function
with
the
argument
55.
A
It
will
pass
that
off
to
rabbitmq
where
it
gets
consumed
by
a
worker
and
then
processed
appropriately.
It's
good
because
it
feels
quite
natural.
You
don't
have
to
write
a
lot
of
plumbing
and
like
I'm
manually
serializing,
this
thing
it's
it's
a
nice
experience
and
then,
secondly,
we
are
calling
bar
as
a
regular
function.
They
still
work
as
regular
functions.
So
this
is
nice
for
a
couple
of
reasons,
one
that
means
you
can
progressively
enhance
the
system.
As
you
find,
you
have
certain
tasks
that
you
know
they.
A
You
would
like
to
break
them
out
and
make
them
dependent
or
independent
asynchronous
tasks.
You
can
do
that,
while
still
maintaining
the
current
behavior
of
letting
them
run
synchronously
and
then.
Finally,
at
the
end,
we
have
a
small
example
of
using
group
and
signature
to
kick
off
a
bunch
of
tasks
all
at
once
and
wait
on
the
result
of
all
of
them.
A
So
how
do
you
actually
run
this?
The
salary
Library
provides
a
command
line
tool
called
salary.
You
basically
use
this
for
configuring,
the
application
you
want
to
run
the
amount
of
concurrency.
You
know
your
concurrency
model,
things
like
that
logging.
You
know
all
that
sort
of
stuff.
So
the
most
useful
thing
in
the
top
line
is
the
concurrency
flag.
You
can
see
we
have
it
set
to
one
there
for
the
default
model.
That
will
mean
that
you
are
running
a
single
process,
so
you
can
process
one
task
at
a
time.
A
A
We
did
not
change
them
in
the
previous
slide,
so
they
are
the
default
of
a
cue
called
salary
which
is
made
up
of
an
exchange
called
salary,
which
is
a
direct
exchange
and
a
routing
key,
also
called
salary,
it's
a
fun
stuff,
and
then
we
have
our
list
of
tasks
which,
as
you
would
expect,
matches
the
list
of
functions
that
we
had
in
the
previous
slide
and
grafana
I'm
not
going
to
talk
a
lot
about
this.
We've
been
talking
about
final
Lots
today,
it's
a
highly
personal
experience
for
what
you
need.
A
A
Another
nice
thing
to
note
is
that
it
impresses
non-techy
people
every
now
and
then
my
wife
will
see
me
working
on
one
of
these
dashboards
and
she'll
say
to
me:
did
you
make
this
and
I
mean
yeah?
I
did
I
I'm,
not
a
UI
guy,
but
I
definitely
made
the
dashboard
right.
So
that's
a
huge
positive.
A
So
if
we
take
all
of
these
things-
and
we
piece
them
together,
what
does
it
look
like?
We
start
in
the
very
top
left
there
with
a
task
scheduler.
This
is
a
very
fancy
way
of
saying
a
script
that
we
run
via
cron.
This
is
responsible
for
generating
the
list
of
tasks
that
we
would
like
to
process
so
say,
for
example,
set
status.
A
It
will
get
sent
to
rabbit
mq
rabbit
m
in
this
state,
I
refer
to
them
as
being
unenriched.
So
what
I
mean
by
that?
As
as
I
alluded
to
earlier,
we
have
multiple
data
centers
and
we
have
multiple
zones
within
those
data
centers.
We
we
want
to
delegate
this
task
out
to
each
of
those
data,
centers
and
zones
as
appropriate.
Another
important
factor
is
that
for
rados
Gateway
tasks
that
we
want
to
run,
we
have
ownership
metadata
associated
with
the
tenancies
that
map
to
internal
IDs
for
our
systems
for
like
routing
tickets.
A
So
what
happens
with
these
unenriched
tasks?
Is
they
go
from
rabbitmq?
They
are
consumed
by
the
scheduler
subscribers
down
the
bottom.
There
they
perform
the
enrichment
that
I
was
talking
about.
They
submit
it
back
to
rabbitmq
and
then,
depending
on
where
they
should
go,
is
dependent
on
like
the
parameters
right
like
the
the
data
center
The
Zone.
A
What
the
actual
task
is
so
continuing
along
with
Seth's
status,
we
can
see
that
that
will
go
down
to
the
cefman
subscribers
and
then
radios
Gateway
level
commands
will
go
to
the
redis
Gateway
subscribers,
which
run
on
threaders
Gateway
nodes.
The
basic
rule
here
is:
if
it's
a
ceph
command,
it
goes
to
the
mon
nodes.
If
it's
a
rados,
Gateway
admin
command,
it
goes
to
the
rotos
Gateway
nodes,
and
then
these
these
tasks
on
these
machines
are
responsible
for
running
the
relevant
surf
command.
A
Collecting
the
output
parsing
it
transforming
it
as
necessary
and
then
publishing
it
into
grafana,
a
more
concrete
example
of
what
I'm
talking
about
so
at
the
very
top.
Here
we
can
see.
We
are
now
configuring
our
list
of
cues
that
we
want
to
consume
on.
We
have
the
scheduler
queue
which
will
be
picked
up
by
the
scheduler
nodes
in
the
previous
slide,
and
then
we
have
extra
cues
for
each
Data
Center.
A
We
then
have
an
entry
point
of
start
bucket
stats.
So
this
is
a
task
that
will
be
published
to
the
scheduler
queue.
It
is
responsible
for
collecting
the
tenants
and
tenancy
information
that
I
mentioned
and
then
for
each
of
our
data
centers,
it
will
call
startuserbocketstats.apply
async
apply.
Async
is
another
method
that
the
task
decorator
will
add
to
your
functions.
It's
basically
the
same
as
dot
delay,
with
the
main
difference
being
that
it
gives
you
more
control
over
the
actual
queuing
mechanism
that
you
use.
A
So
you
can
see
I'm
using
it
here
to
specify
which
queue
it
should
be
published
to
for
the
relevant
Data
Center.
You
can
also
use
it
for
timeouts
retries.
All
of
that
sort
of
thing,
and
then
if
we
go
down
to
the
actual
implementation
of
start
user
bucket
stats,
you
can
see
it's
fairly
straightforward,
we're
doing
a
rados
Gateway
user
list
and
then
for
each
user,
we're
generating
a
task
to
collect
the
bucket
stats
for
that
given
user
and
then
further
on
down
the
implementation
of
the
actual
bucket
stats
collection.
A
This
is
this
is
quite
nice
because
it's
clearly
a
highly
flexible
system.
You
can
change
this
say.
For
example,
you
decided
that
collecting
these
metrics
at
a
user
level
is
not
fast
enough.
You
want
to
collect
them
at
a
Bucket
Level.
Then
you
could
do
a
bucket
list
on
your
user
and
then
collect
the
individual
bucket
stats
and
when
we
have
like
50
000
buckets
that
would
generate
50
000
tasks.
A
The
limit
here
is
not
that
we're
going
to
run
50
000
tasks
at
once,
it's
determined
by
the
concurrency
we
had
earlier.
So
this
is
nice
because
it
gives
a
sense
of
the
scale
and
breaking
your
tasks
up
into
as
small
a
task
as
possible
will
give
you
the
best
concurrency
and
like
the
best
responsiveness
in
the
system.
A
So
the
results
of
this
as
I
said
it's
very
responsive.
We
can
collect
metrics
in
under
five
minutes.
This
means
that
users
see
quarter
changes
effectively
immediately,
which
is
obviously
nice
for
them.
They
can
see.
You
know.
Oh
I
was
at
80
percent
and
I
deleted
some
stuff,
and
now
I
can
see
I'm
down
at
50..
I'm
out
of
the
woods
load
is
distributed
really
well.
This
is
significantly
better
resource
utilization.
We
pay
a
lot
of
money
for
the
hardware
that
we
have
using
it
adequately,
for
this
task
is
important.
A
There's
a
lot
of
built-in
retries
and
timeouts.
This
is
something
that
both
salary
and
rabbitmq
give
you.
You
don't
have
to
write
a
lot
of
code
yourself
for
let
you
know
for
I,
and
you
know,
range
10,
try
this
thing.
It's
just
say:
retries
equals
five
and
you're
good
to
go
and
it
scales
easily
as
the
system
grows.
As
we
saw,
you
just
increase
a
number
and
a
command
line
and
you're
Off
to
the
Races,
it's
easy
to
add
new
metrics,
it's
just
regular
python
code,
you're
calling
a
process
and
pausing
the
output.
A
So
it's
it's
quite
nice!
You
don't
have
to
put
a
lot
of
work
into
the
metrics
once
you
know
what
command
to
run
and
this
integrates
with
existing
metric
based
alerting.
As
I
said,
we
have
a
lot
of
pre-existing
systems
here
at
Bloomberg.
Our
users
are
very
accustomed
to
that
user
experience.
We
don't
want
to
write
a
whole
brand
new.
You
know
alerting
system,
so
it's
quite
nice
how
this
hooks
in
potential
plans
for
the
future.
We
could
look
at
extending
these
metrics
for
RBD
the
radius
block
device.
A
That
sort
of
depends
on
our
interest
for
it.
We
can
also
collect
metrics
on
the
metrics.
So
what
I
mean
by
this
is,
if
you
collect
metrics
on
how
often
and
how
long
it
takes
you
to
collect
your
specific
named
metrics,
then
you
know
where
to
focus
your
optimization
efforts.
So
if
you
have
a
task,
you're
running
every
five
minutes,
but
it
takes
15
minutes
to
complete.
A
Just
from
today,
there's
been
a
lot
of
talks
about
self-est
Telemetry,
so
this
feels
like
something
people
might
find
useful,
but
it
sort
of
depends
on
all
of
you,
and
that
is
everything
we
are
hiring.
So
if
you
are
interested
in
working
on
self
or
any
of
the
other
cool
stuff
that
we
have
here,
please
do
come
and
talk
to
me
because,
as
I
said,
that's
everything
so
does
anyone
have
any
questions.
A
C
B
C
B
The
op
latencies
everything,
but
it
becomes
like
in
the
old
school
monitoring,
stack
that
we
have
at
certain.
It's
like
you
can't
search
you
can't
find
out.
We
want
to
find
out
which
was
the
slowest
disc
over
the
last.
You
know
24
hours,
things
like
that,
are
you
able
to
to
get
like
top
like
filter
and
get
top
of
say
like
the
bucket
that
had
the
most
iops
over
the
last
day,
or
things
like
that?
Have
you
yeah.
A
Totally
I
mean
like
this,
this
sort
of
comes
back
to
what
I
was
saying
about
grafana
right,
like
it's
heavily
dependent
on
what
you're,
using
for
your
time,
series
information
and
so
I
can't
really
speak
specifically
about
the
time
series
database
that
we're
using
to
back
grafana
here.
But
you
know
that
I
personally
feel
that,
yes,
it
would
be
easy
to
get
that
information
out
like
the
querying
model
lets
you
do.
C
A
Easier
to
scale
yeah,
that's
actually
really
important
note
is
that
this
is
more
focused
on
overall
cluster
level
metrics
and
like
Roto
scatware
level
metrics
for
doing
like
individual,
like
node
level,
metrics
like
what
you're
talking
about
for
what
OSD
is
giving
me
trouble.
A
D
I
think
there's
a
lot
of
interesting
stuff
here,
especially
the
rabbit
integration
and
scaling,
but
I
want
to
draw
you
to
mention
that
there's
there
is
Upstream
work
going
on
to
to
to
that.
I
think
is,
can
possibly
convergent
with
some
of
this
there's
a
new
thing
called
the
Seth
exporter.
Damon.
That's
that!
That's
that's!
That's!
That's!
D
That's
that's
running
per
node
and
containerized
environments
or
deployed
otherwise
that's
designed
to
sort
of
vacuum
up
all
the
perf
counter
information
and
maybe
some
and
then
that
those
that
those
let's
say
the
RWS
or
other
demons
as
it
grows,
I
think
have
and
then
and
then
and
then
there's
more
types
of
metrics
being
there's
extensions
to
the
perf
counters
interface
being
being
being
created
to
make
it
more
hierarchical
and
then,
for
example,
rgw's
are
gaining
the
ability
to
to
sum
up
counters
per
bucket
number
user
and
those
are
flowing
through
that
system.
D
This
you
know
this,
and
so
those
would
be
interesting
types
of
things
to
potential
to
have.
This
thing
talk
to
also
the
Seth
RW
admin
interface,
which
can
which
could
be
widened.
But
so
you
don't
so
you
don't
have
to
launch
around
CW
admin
processes
to
scrape
things.
That's
a
really
it
doesn't.
It
doesn't
look
like
it,
but
it's
a
pretty
heavyweight
operation
that
spawns
up
a
new
node,
essentially
in
there
it
should
be
a
cluster
gets
all
that
going
and
then
spits
out
your
six
Jason
Records,
but
anyways
very
incredible.