►
From YouTube: Ceph Performance Meeting 2021-05-20
Description
A
All
right,
let's
see
actually,
as
far
as
I
can
tell,
there
are
no
new
performance
related
aprs
this
week
could
have
missed
something
but
didn't
didn't
see
anything
unless
it
was.
It's
been
a
pretty
slow
week
for
new
pr's.
It
looked
like
when
I
scanned
new
stuff.
A
A
A
The
other
one
I
noticed
was
that
there's
another
kind
of
long-standing
pr
about
auto
tuning
having
the
manager
set.
The
container
memory
limits
automatically.
So
this
is
an
auto
tuning
inside
the
osd
or
you
know
inside
demons
in
general.
I
guess,
but
the
idea
that
you
have
the
container's
memory
limit
set
automatically
based
on
some
fraction
but
unknown,
something
that
actually
confess
I
don't
know
exactly
how
it
works
and
then
our
auto
tuning
code.
A
We
then
pick
it
up
after
that
figure
out
how
to
distribute
that
memory
within
a
particular
demon.
Okay,
so
updated
atoms
do
write
small
pr.
I
think
we
talked
about
that
extensively
last
week.
I
think
the
idea
now
is
that
we're
just
gonna
make
flags
for
when
to
have
do
write,
small,
deep
afraid
or
or
direct
io,
and
then
try
to
observe
what
that
actually,
what
behavior
that
results
in
so
that's
still
kind
of
an
open-ended
question.
I
think
what
would
be
by
default.
A
Oh
there's
an
rgw
one
here
this
osd
compression
bypass
after
rgw
compression
pr
dc.
I
think,
if
I
remember
you
were
reviewing
that
even
this
week,
maybe
anything
new.
B
A
Well,
all
right
in
the
the
last
updated
pr
that
I
saw
there's
interest
again
in
my
cash
spinning
yard
the
age
bidding
pr.
So
the
idea
here
is
that
for
all
of
our
lru
caches
in
blue
store
and
potentially
anything
else,
this
is
the
priority
cache
manager.
A
We
would
make
it
so
that
we
can
track
the
items
in
the
different
lru
caches
and
associate
them
with
age
bins
where
one
bin
might
be
the
last
five
seconds.
Another
might
be
the
last
10
seconds
30
seconds
whatever,
and
the
idea
behind.
That,
then,
is
that
kind
of
on
a
very
coarse
grain
basis.
You
can
look
at
what
the
relative
ages
of
different
crashes
are.
A
The
the
goal
of
this
actually
is
very
much
inspired
by
rgw,
where
in
some
certain
circumstances,
you
may
really
want
to
heavily
cache
omap
data
and
in
other
circumstances
you
may
want
to
very
heavily
cache
phone
number
data
and
in
others
you
might
want
to
mix
in
both,
and
it
might
change
over
time.
A
The
idea
would
be
that
over
time
become
a
gradual
process,
not
real,
gradual
but
somewhat
gradual.
You
would
migrate
memory
more
towards
own
node
or
more
towards
own
map
hashing,
depending
on
what
the
current
workload
looks
like.
What
the
relative
ages
of
those
caches
is,
though,
anyway
radic
kind
of
did
a
quick
review
of
it
this
week.
I
need
to
update
that
pr,
based
on
a
branch
that
adam
and
I
were
working
on
last
fall
and
we
might
actually
try
to
get
in
here
so
anyway.
That's
that
pr.
A
That
was
all
I
had
for
this
week.
Did
I
miss
anything
from
anyone.
C
Hey
I'd
like
to
talk
about
it,
but
give
me
like
10
minutes
or
so.
A
Absolutely
sounds
good
adam.
A
Let's
talk
about
thinking
behavior
in
the
kernel
and
buffered
io
and
direct,
I
o,
if,
if
you
are
ready.
D
D
Read,
writes
two
blocks
four
to
64k
and
more
strangely,
it
was
not
and
constantly
lower
performance.
D
First,
if
I
left
buffer
at
io
for
reads
and
but
right
path
was
replaced
with
direct,
writes
just
doing
using
aio
writes,
I
got
perfectly
stable,
high
performance,
that's
one
one
operation
I
did,
and
the
second
thing
I
did
was
applying
this
kind
of
patch,
which
we
had
for
some
time
and
then
retracted,
and
that
broke
blue
fs,
lock
just
yelled
at
blue
fs
lock
for
time
of
writing
that
one
was
dead
because
of
various
reasons.
We
it
cost
us
problems,
but
for
testing
purposes.
D
This
solved
my
performance
problems
and
it
seems
that
root
cause.
Is
that
that
tiny,
lock
in
blue
fs,
when
I'm
trying
to
frequently
write
to
write
a
headlock
and
the
other
threads
rocks
db
threads
are
trying
to
write
to
new
sst
files
that
are
just
being
created
could
interfere
if
right
ahead,
lock
wins
it's
just.
I
cannot
see
the
difference.
Just
roxdb
works
a
bit
slower,
but
on
when
right,
right
ahead,
lock
from
kv
sync
chat
loses.
Then
I
get
huge
performance
job.
So
the
topic
for
me
would
be
two
things.
First,
I
understand
it's.
D
Okay,
that
I
start
to
make
blue
fs
more
multi-threaded
than
it's
currently
and
the
second
is
it
okay,
and
is
it
really
sane
to
use
the
same
data
once
in
buffered
mode,
but
write
them
using
direct
mode
and
exactly
in
that
direction?
First,
I
write
them
and
I
don't
read
them
during
the
right
process
like
we
create
any
file
in
blue
fs.
But
after
I
finish
writing
with
direct
mode
and
buffered,
then
I'm
reading
them
with
buffered
mode
will
it
have
some
problems
or
it's
a
valid.
E
Path,
I
think
the
key
thing
is
that
if
you
ask
a
kernel
developer
about
that,
they'll
say
no,
don't
do
it
it's
terrible,
but
the
reality
is
that
in
the
direct
I
o
path,
there's
sort
of
a
best
effort
attempt
to
invalidate
the
page
cache.
So
the
direct
write
will
blow
away
any
pages
that
are
in
in
that
page
mapping.
E
And
so,
if
you
read
it
again
afterwards,
it'll
read
it
so
it'll
work
as
long.
Basically
as
long
as
there
are
no
racing
processes
or
threads
that
are
also
trying
to
read
at
the
same
time,
basically
because
okay,
so.
E
Well,
if
you
were
doing
a
direct
write
and
while
you
were
doing
that
direct
write,
somebody
did
a
dd
if
on
that
block
device
and
got
unlucky,
they
might
read
the
page
after
it
invalidated
it
and
you
would,
but
it
would
still
read
the
stale
data.
So
you
could
have
the
page
cache
polluted
with
val
data.
E
D
D
E
Then
you
should
be
okay
as
long
as
there
isn't
some
external
process.
That's
writing,
which
we're
just
going
to
assume
there
isn't
like
blue
store.
Is
the
only
thing
right
into
the
device?
The
only
thing
is
you
got
to
make
sure
that
it
it
an
external
reader
doesn't
screw
you
up,
and
I
can't
I
can't
remember
exactly
I
just.
E
That
the
direct
I
o
code,
like
does,
that
invalidate
thing.
I
can't
remember
if
it
does
it
before
the
or
after
the
write,
but
you
got
to
make
sure
that
make
sure
that
it's
basically
done
after
you're
right
as
has
hit
the
device
as
long
as
it's
act,
it
doesn't
have
to
be
after
the
sync,
but
it
has
to
be
after
it's
been
acknowledged
by
the
by
the
device.
D
E
E
That's
I
mean
the
code
is
the
same
like
you'll
you'll.
Do
your
your
right
to
call
or
aiorite
or
whatever?
As
soon
as
the
aio
is
acknowledged,
that
the
write
is
completed,
then
you
need
to
do
an
f
allocate
whatever
it
is.
E
Don't
need
something
like
that
or
camera
what
exactly
what
the
arguments
are,
but
that's
the
best
assist
call
that
we'll
basically
throw
away
any
clean
pages
in
the
cache
in
the
page
cache
or
you
need
to
go,
look
at
the
kernel
code
and
make
sure
that
the
direct
I
o
path
does
that
same
invalidation
and
validate
page
range
mapping
range
or
whatever
it
is
after
the
write
completes.
E
E
Okay,
actually,
you
know
what
here's
the
other
thing,
though,
if
you
aren't,
if
you
are,
if
you're
doing
this,
if
you're
doing
a
direct
write
and
then
using
buffered,
reads,
you're
always
going
to
read
your
rights
back
like
if
the
first
time
at
least
and
that
before
it'll
get
into
the
page
cache
and
then
a
subsequent
read
will
work.
So
is
that
really
what
you
want
is
that
is
that
the
goal
of
this
is
not
to
not
the
cash
rights.
Unless
they're
read.
D
It
kind
of
is
because
I
don't
really
expect
roxdb
to
read
my
data
right
after
I'm
writing
them,
because
I'm
creating
new
ssd
file
and
don't
really
expect
it
right
away.
Okay,
if
it's
right
a
headlock,
it
will
definitely
not
it
not
need
it
and
as
for
sst
file,
it
will
take
some
time
before
it
will
require
yeah.
Okay,
maybe.
D
A
G
A
Adam
I
have
to
apologize.
I
wanted
to
do
that
last
week
and
then
I've
gotten
sucked
into
lots
of
other
projects.
So
I
I
haven't
actually
looked
at
your
code.
Yet
sorry.
A
All
right
josh:
are
you
ready
yet
we
also
have
other
stuff
we
can
talk
about,
if
not.
C
Yeah
yeah,
I
can
go
into
the
temperature
stuff
a
little
bit
right
now.
So
essentially,
what
I'd
like
to
talk
about
today
is,
I
think,
think,
about
what
kinds
of
performance
information
it
would
be
useful
to
collect
from
telemetry
and
what?
What
and
compare
that
with
like
what
we
have
today
in
proof
counters
and
what
what's
what's
missing
and
how
we
could
add
that
we
have
an
intern
starting
next
week,
help
out
with
us,
but
I
just
want
to
get
some
of
the
ideas
out
there
earlier
than
later.
C
So
my
one
thought
is
to
collect
a
little
more
information
about
distributions
of
I
o
sizes
and
types
of
I
o
try
to
get
a
better
idea
of
what
kind
of
workload
these
workloads
we
see
in
aggregate.
C
Well,
thanks
in
the
chat,
you've
added
the
exactly
the
fact
I
was
thinking
of
the
geometrics.
C
It
also
be
interesting
to
see
perhaps
how
this
affects
our
the
internal
state
of
the
osd
and
blue
store.
Those
are
the
different
kinds
of
data
structures,
maybe
what
different
kinds
of
like
patterns
we
see
in
load,
sizes
or
numbers
of
key
key
values
that
we
see
we've
got
some
times
in
the
past
kind
of.
Had
these
questions
we
wanted
to
answer
about
what
does
an
rvd
data
set?
Look
like
what
does
an
rgw
data
set?
C
A
C
Yeah,
that's
true,
although
I
I
think
it
wouldn't
hurt
to
collect
both
like
all
kinds
of
metrics
in
all
cases.
It's
it's
not
like
me
to
be
too
concerned
about
turning
these
off.
We
do
need
to
be
mindful
that
we're
we're
telemetry
we're
kind
of
gathering
these
in
aggregate.
My
entire
cluster
at
most
usually
like
once
a.
C
C
Another
major
area
that
can
listen
to
the
pad
there
as
well,
is
like
the
memory
usage.
We
often
have
questions
about
how
much
memory
is
is
like
the
pg
log
taking,
or
are
we
using
for
caching
versus
block
caching
versus
other
pieces?
C
That
could
be
pretty
interesting
to
see
if
there's
a
different
kind
of
patterns
or
trends
that
we
would
miss.
Other
question
that
comes
to
mind
with
this
is:
how
much
is
the
aggregation
going
to?
C
Is
the
aggregation
go
still
going
to
be
useful
for
us
like
we're
talking
about
collapsing
entire
clusters,
osds
and
adding
them
adding
all
the
perfect
candidates
across
all
the
entire
cluster.
F
Even
for
the
memory
information,
like
you
know,
at
what
cadence
do
we
capture
that,
like,
like
telemetry,
is
doing
this
once
in
a
day?
Is
that
enough
or
you
want,
we
want
to
think
about
something
else
in
that
direction.
I
think.
F
A
You're
asking
right,
if
you
want
to
know
kind
of
very
if
you're,
looking
at
like
a
data
set
over
a
year
right,
one
when
one
day
is
not
a
bad
sampling
rate,
you'll
kind
of
see
gradually.
A
Okay,
we
generally
know
that
the
clusters,
you
know
doing
rgw
and
they're
doing
this-
I
o
size,
typically
and
and
their
workload,
doesn't
change
they're
just
doing
the
same
thing,
fine,
but
if,
if
you
see
like
that,
their
their
workload
is
changing
rapidly,
some
days
they're
doing
this
some
days,
they're
doing
that
someone's
coming
in
with
a
really
small,
I
o
size
workload.
Now,
all
of
a
sudden,
someone
else
is
doing
huge
ones.
That
kind
of
granularity
is
probably
not
going
to
be
great
at
capturing
those
details.
A
C
You
could
do
some
kind
of
like
summarization
over
the
course
of
a
day
like
even
if
you're
sampling
you're
only
sending
the
data
once
a
day.
You
could
also
track
like
what's
the
max
memory
usage
for
each
kind
of
memory
over
that
day,
maybe
even
bucket
it
and
saying
so.
You
can
kind
of
say
hey
for
this
day.
We've
used
this
much
memory
for
this
many
hours,
roughly.
G
There
is
also
there
are
still
workloads
in
data
center
which
run
daily
on
different
schedules
like
nightly
batch
processes,
and
things
like
that
so
and
usually
they
run
when
there
is
less
cloud
with
other
workloads,
that's
the
way
to
to
better
utilize
the
the
data
center.
So
if
you
do
it
on
on
a
daily
basis,
you
could
get
some
kind
of
averages
which
are
not
related
to
workloads
which
work
on
completely
different
times.
So
I
agree
with
market
hourly
could
be
if
more
information,
which
is
sometimes
could
be
really
interesting
information
about
this.
G
So
in
the
past
life
I
was
doing
billing
for
a
telecommunication,
assisted
providers
and
batch
processes
with
huge
amounts
of
ios
that
are
done
only
only
nightly.
E
I
think
we
should
just
be
cautious
that
we're
only
collecting
data
that
we
think
we're
actually
going
to
make
use
of,
as,
for
example,
if
you're
collecting
performance
data
on
an
hourly
basis.
That
basically
will
tell
you
what
the
like
daily
load
cycle
is
on,
that
cluster
might
be
identifying
information
and
might
be
able
to
let
you
figure
out
what
that
cluster
is
and
whatever
what's
going
on,
and
are
we
really
gonna
I
mean?
Do
we
really
need,
like
this
24
data
points
over
the
course
of
a
day.
E
Right
yeah,
I
mean,
if
we're
just
if
we
just
want
to
make
sure
that
we're
not
like
always
sampling
it.
When
the
term
tree
gets
sent
out
right
when
there's
a
batch
job
going
on,
then,
like
you
know,
sample
it
three
times
or
four
times
a
day
or
something
like
that,
which
probably
isn't
enough
to
like
figure
out
what
time
zone
the
cluster's
in
but.
C
We
could
also
remove
the
time
component
and
just
build
a
distribution
of
the
of
the
the
they
say:
memory
usage
over
the
course
of
the
day
that
reference,
the
time
at
all.
G
The
thing
is
that
this
sometimes
it's
workload
of
completely
different
characteristics,
so
it
could
be
a
way
larger,
ios
or
smaller,
ions
or
whatever.
So
the
question
is
also
not
the
amount
of
data
points,
but
what
you're
getting
for
this
data
point,
but
averaging
two
different
workloads
could
lead
to
something
a
bit
bizarre.
G
G
A
F
Yeah,
that's
that's
a
good
point
mark.
I
think
the
question
I
have
in
my
mind
is:
what
are
we
hoping
to
know
or
like
what
is
that
one
data
point
that
we
don't
have
that
we
want
to
get
out
of
this
telemetry
information?
Let
us
say
we.
We
start
capturing
all
this
information
at
a
daily
cadence
or
whatever.
What
will
we
be?
You
know
looking
for?
A
The
the
question
that
I
used
to
ask
neil
all
the
time
and
he
hated
it
because
he
you
know
he
didn't,
have
the
data
and
I
kept
pestering
him
was
what
did
people
actually
do
with
their
clusters?
B
G
I
I
have
one
use
case:
I'm
not
sure
whether
it
fits
exactly
this,
but
it's
a
really
use
case
that
I
try
to
work
on
with
different
things.
We
have
always
the
we
try
to
solve
the
issue
of
bin
packing
pods
into
into
nodes
in
kubernetes,
so
we
could
use
less
resources.
G
We
know
from
statistics
that
click
on
public
clouds.
We
have
like
42
of
ideal
resources
that
customers
pay
for
it.
We
want
to
reduce
this
number,
so
we
we
so
so
today
we
ask
for
kubernetes
for
specific
user
memory
and
it
does
the
scheduling.
G
But
so
if
I
know
that
the
max
memory
is
specific
value
and
I
reach
this
value-
eight
gig
every
day-
it's
good,
but
it's
not
good
enough,
because
if
I
get
it
only
for
two
hours
during
the
night
and
during
these
two
hours
I
I
have
other
pods
which
do
not
use
the
maximum
values.
Then
I
could
do
a
more
efficient
film
setting
so
understanding
more
about
the
max
max
use.
One
number
of
max
resource
usage
leaves
us
in
a
very
safe
position,
but
it
could
be
very
expensive
for
for
the
customer.
G
So
if
you
want
to
be
able
in
doing
vin
packing
of
pods
into
into
notes,
we
need
more
data
than
just
max
values.
We
need
some
kind
of
final
granularity,
so
I
don't
know
if
this
is
the
exact
use
case
for
this
thing,
but
it's
a
real
use
case
that
is
needed
somewhere
and
we
are
working
on
issues
in
vice
president,
but
we
need
to
do
some
kind
of
matching
of
of
resource
usage
with
all
the
workloads
not
only
excess,
but
all
the
work
together
to
try
to
to
save
money
for
the
customer.
J
One
more
thing
about
this:
I
I
don't
think
it's
necessary
to
have
the
customers
send
the
data
all
the
time
the
customer
can
aggregate
the
data
locally
and
then
I
don't
know
once
a
month
send
us
a
summary
and
they
can
like.
We
don't
need
the
raw
data
we
could
and
by
the
customer
meeting
our
code
on
the
customer
side
with
aggregate
data.
J
Compress
co
do
some
calculation
and
then
send
us
once
a
month
some
information
about
what
they're,
using
what
kind
of
installation
we
have.
I
had
some
questions
in
the
past.
What's
the
average
or
what's
the
percentile,
what
would
be
the
number
of
objects
on
the
system?
What
would
be
90
percent
of
how
many
extent
should
we
expect
to
see
on
a
location,
this
kind
of
information?
J
J
Is
it
changing
daily
when
the
system
is
under
this
kind
of
flood
and
under
that
kind
of
plot?
So
this
information
could
be
collected
locally
and
then
build
nice
histograms
and
then
not
the
instagram
just
coordinate
the
data
or
correlate
the
data
and
send
it
back
and.
D
J
Endless
amount
of
data
that
we
could
benefit
from
and
we
could
also
have
developers
when
they
have
questions
and
they
say.
Could
you
tell
me
how
we're
dealing
in
this
case?
So
what's
the
use
case
by
the
customer,
then
we
propagate
this
question
in
into
the
field
and
get
the
information
back
from
the
customer.
J
A
The
the
tricky
thing
with
that,
I
think
right
is
the
concern
about
potentially
identifying
information.
J
We
don't
care
about
the
data.
We
never
want
to
see
the
data
we
just
care
about
the
layout,
I
mean
as
a
customer.
What
do
you
have
to
hide
about
your
cash
heat
ratio
or
the
number
of
objects
in
your
system,
or,
what's
the
that,
if
I
beat
an
instagram
of
of
your
allocation,
what
kind
of
fragmentation
do
you
have
all
kind
of?
I
don't
see
any
of
this
as
a
secret.
J
A
Standpoint,
but
maybe
they
don't
want
people
to
know
that
they're
working
on
really
large
data
sets,
maybe
they
have
some
kind
of
like
really
secret
ai
project
they're
working
on
and
they
don't
want
people
to
to
know
that
this
is
what
they're
doing.
I
know
it's
kind
of
contrived,
but
you
know
maybe,
if
you're
really
determined
you
could
determine.
H
You
know
confidential
information.
H
J
I
J
Previous
work,
I
was
working
the
company
and
the
customers,
all
of
them
being
fortune,
500
banks,
investment
houses,
you'd
have
golden
jacks,
fidelity,
bloomberg,
deutsche
bank
and
the
like,
and
they
were
extremely
happy
to
share
this
information,
because
by
sharing
this
information
it
means
they
influence
the
next
generation
of
the
product.
If
they're
doing
something
unusual-
and
we
never
thought
about
this
scenario-
we're
never
going
to
build
the
code
to
to
address
this
issue.
J
K
J
Much
you
could
gain
from
hiding
this
information
and
again
all
of
them.
So
the
only
people
which
didn't
share
information
was
government
agencies
like
nsa
cia
fbi
and
the
like,
which
they
never
give
you
any
access.
But
I
don't
think
the
problem
was
this
kind
of
information
they
just
didn't
want
to
give
you
an
api
to
access
the
system,
because
you
might
do
something
dangerous
for
them.
C
Yeah
and
keep
in
mind
that
this
telemetry
is
all
opt-in,
and
so
that
I
think
the
idea
with
this
performance
data
collection
would
be
that
it
would
be
a
new
channel
information
that
people
would
have
to
explicitly
agree
to
sharing
and
it
would
all
be
anonymized
in
terms
of
like
contacting
users.
If
we
see
something
crazy
going
on,
I
think
you
read,
I
think
you
talked
about
the
past
about
maybe
adding
some
kind
of
like
optional
email
address.
People
could
associate
with
their
inflammatory
reports.
I
Yeah
right
now
it's
already
implemented.
We
were
thinking
about
giving
further
options
to
send
like
an
automatic
feedback
to
the
user,
so
yeah
that's
still
not
implemented,
but
if,
if,
if
users
opted
in
to
identify
themselves
which
they
currently
can
and
we
have
constructive
feedback
to
share
with
them,
then
we
can
contact
them
now.
Already,
that's
a
very
good
point,
josh
and
also
gabby.
I
I
agree
with
both
of
you
and
we
somehow
need
to
find
the
sweet
spot
between
this
privacy
chaos
to
figure
out
how
we
can
help
the
user
on
the
one
hand,
without
compromising
any
of
their
privacy.
I'm
curious
how
other
open
source
projects
are
doing
it.
If,
if
at
all,
does
anyone
have
any
knowledge
about
it?.
A
I
used
to
work
for
a
super
computing
center
and
we
had
luster
storage,
which
is
kind
of
vaguely
similar
to
ceph,
but
more
kind
of
pointed
towards
large
hpc
clusters
running
scientific
workloads.
So
this
is
like
10
or
15
years
ago.
So
it's
been
a
long
time,
but
they
they
did
something
like
that
right.
They
had
like
a
a
thing
that
could
run
in
the
background
that
would
gather
it
would
just
scrape
a
bunch
of
data
off
the
system
and
send
it
to
them.
A
We
we
used
it,
we
let
them
do
it
for
ours,
because
we
wanted
to
get
we
didn't
want
to
have
to
manually.
Do
that
every
time
there's
a
problem
in
the
system,
because
it's
so
unstable
we
want
them
to
just
have
it,
so
they
could
then
like
look
at
it
and
that's
what
was
wrong.
So
they
they
did.
They
did
something
like
this
sort
of
it
was.
It
was
much
less
sophisticated.
A
It
literally
just
like,
like
scan
through
a
bunch
of
system,
stuff,
create
a
tgz
file
and
then
like
sent
it
to
them.
I
But
was
that
an
opt-in
basis
and
was
it
somehow
available
online
even
in
an
aggregated
form,.
A
It
was
not,
it
was
only
to
them.
I
don't
know
if
it
was
opt-in
or
opt-out,
but
you
had
the
choice.
A
And
it
only
worked
if
your
cluster
was
available
to
them
over
the
internet.
So
in
this
case
I
think
we
had
set
up
special
rules
that
they
could
could
actually
come
in
and
grab
this
or
complete
push
to
them
or
whatever.
I
don't
remember
how
it
worked.
But
but
if
you
had
like
a
totally
isolated
storage
network,
it
didn't
work.
I
Yeah
well
for
us,
if
it's
also
totally
isolated,
it
will
not
work,
but
there
is
an
option
for
defining
a
proxy
and
send
the
data.
But
I
don't
know
where
the
users
will
go,
that
extra
mile.
For
that.
I
But
if
we
want
to
have
the
relation
between
the
cluster
and
the
performance
data,
which,
of
course,
we
do
opposed
to
the
relation
between
the
cluster
and
the
device
health
data
that
we
currently
do
not
have
any
relation
between
them,
we
need
to
yeah.
We
need
to
be
extra
careful
with
that
because
of
the
option
for
the
users
to
identify
themselves
and
so
on.
C
A
Is
there
any
distinction
right
now
between
what
a
customer
might
be?
Okay
with
us
having
versus
what
they're,
okay
with
being
displayed
publicly.
I
Well,
they
can
opt
in
to
channels,
so
they
can
decide
that
they
are
okay,
only
with
opting
to
the
device
health
metrics
channel.
So
they
will
just
turn
on
this
specific
channel
and
they
will
not
send
any
other
cluster
data
crash
data
and
some
other
basic
cluster
data.
I
But
then,
once
we
have
the
data
we
can,
we
can
publish
it
aggregated
or
we
can
also
publish
it,
as
is
like
we're
going
to
do
soon
with
with
the
device
health
metrics.
But
but
we're
not
gonna.
Do
it
with
a
customer,
but
at
least
not
for
now
so.
F
C
That's
just
that's
meaning
existing
data
that
we
already
have
perfect
counters
for
not
we're
not
actually
reporting
that
in
telemetry.
Yet.
I
So
is
this
exposed
via
the
manager
as
well.
C
Yeah,
the
manager
gets
the
reports
of
the
perfect
comfort
counters,
at
least
there's
at
least
a
subset
of
them
already.
C
We
might
want
to
do
even
more
aggregation,
especially
since
telemetry
is
only
doing
this
once
a
day.
It
can
gather
much
more
detail.
Information
than
the
manager
needs.
A
A
A
I
Yeah
but
ever
again
everything
is
anonymized,
but
of
course
you
can
try.
If,
if
you
have
some
knowledge,
maybe
you
can
try
and
understand
what
the
cluster,
like
who's
the
user,
but
we
really
do
everything
to
anonymize
it.
A
The
scenario
I'm
thinking
about
is
a
poor,
low
level
system
that
had
been
publishing
data
for
three
years,
because
everything
everyone
was
okay
with
it.
They
get
a
new
manager
that
all
of
a
sudden
says
that
this
is
all
horribly
wrong,
and
you
can't
do
this
and
you
need
to
go
back
and
delete
all
the
previous
data
and
then
they're
trying
desperately
to
you,
know,
meet
their
managers.
New
managers
wishes.
K
Yeah
and
it
would
be
good
for
gdpr
just
to
be
compliant
to
allow
just
deletion
of
data.
I
I
That's
the
agreement
that,
if
we
use
there
might
be.
I
There
might
be
a
paragraph
here
that
says
that
they're
eligible
of
asking
to
delete
the
entire
data
that
was
already
sent.
But
honestly,
I
I
don't
remember
I'll-
need
to
go
over
that
again.
A
Oh,
I
was
just
saying:
maybe
we
don't
even
need
to.
Maybe
it's
good
enough
to
say:
hey
you
agreed
to
this.
You
know
you
sent
the
data,
you
know
you
know
we
can
do
our
best
to
accommodate
your
request,
but
you
know
this
is
you
you
already
agreed
to
the
license.
I
Yeah
we
want
everyone
to
be
happy,
so
we
don't
want
to
cause
any
frustration
to
any
any
side.
I'm
very
optimistic
with
what
gabby
is
is
saying
so
that
that's
good,
but
we
just
have
to
be
prepared
for
a
worst-case
scenario
dealing
with
data.
That's,
I
have
to
be
really
careful.
I
I
had
a
quick
question
about
the
collection
of
the
data,
so
are
we
relying
on
prometheus
for
this
performance
data
or.
C
I
C
Entirely,
based
on
the
performance
cameras
that
stuff
uses
internally.
I
Okay,
we're.
I
Okay,
cool
yeah,
because
otherwise,
if
you
rely
on
prometheus
and
it's
not
always
available
so
I
mean
I.
F
I
It's
shipped
with
chef,
but
sometimes
there
are
more
issues.
A
A
Sorry
about
that,
if,
if
we're
only
collecting
data
on
a
24-hour
period,
it
would
be
useful
to
know
standard
deviation.
I
You
mean
from
on
the
client
side
already
when,
when
we
collected
to
collect
that
on
the
client
side
to
begin
with.
D
Yes,
I
think
I
think,
mark
you're
right
if
even
our
some
performance
counters,
if
we
were
to
export
them
through
telemetry,
we
we
should
rethink
what
we
are
actually
capturing,
because
now
it's
in
the
current
form,
there
might
be
useless.
C
If
you
have
a
histogram
performance,
counter
type
that
you
don't
have
on
by
default
today,
because
it
would
cause
some
performance
impacts
when
you
turn
it
on,
especially
in
of
scamming
yesterday,
just
trying
to
use
more
sampling
with
it.
So,
instead
of
adding
every
single
io
to
that
histogram
say
I'd,
add
if
you're
100
I
or
every
thousand
theo.
I
Yeah
I
agree
with
adam
and
nija
mentioned
that
also
earlier.
I
think
the
the
key
question
is
what
what
and
why
we
really
need
to
understand
why
we
need
the
data,
what
questions
it
answers
and
then
will
be
a
lot
wiser
on
knowing
what
to
collect
and
how.
L
D
G
And
there
is
something
tricky
about
the
thing
that
we
first
we
need
to
know
what
what
are
the
answers?
What
the
question
want
to
answer
them
get
the
data,
because
I
think
the
other
way
around
is
as
as
more
data
that
we
have.
We
could
answer
more
questions.
We
could
think
of
new
questions
that
we
didn't
even
think
we
are
going
to
ask,
and
if
we
have
the
data,
we
could
get
more
answers
that
relate
to
what
I
said
at
the
beginning,
like
you
know,
been
packing
off
of
pods
into
notes.
G
G
I
would
try
to
think
outside
the
box
about
the
question,
not
not
touch
the
the
two
questions
that
I
I
have
in
in
in
my
mind
now
and
say:
okay,
let's
put
enough
information
to
answer
these
questions
because
basically
they
think
that
could
help
us
improve
the
system
over
time
in
various
ways,
and
maybe
some
of
them
are
either.
We
don't
recognize
them
at
this
point
in
time
or
they
are
very
low
priority
or
we
don't
think
about
them
and
they
change
over
time
again.
G
The
the
example
that
I
gave
is
basically
driven
from
work
on
public
cloud,
which
was
not
really
an
item
for
us
two
years
ago,
but
now
it
is
more
important
for
us.
So
it's
I.
I
would
try
not
at
least
not
to
limit
ourselves
to
to
first
putting
the
question
and
then
doing
this,
but
try
to
think
a
bit
more
global
way.
F
F
Yeah,
that's
that's
absolutely
right,
I
mean
more.
Here
is
always
better
right,
but,
like
the
the
then
you
know
from
my
perspective,
if
I
were
a
user
and
if
you
tell
me
that
I'm
collecting
your
performance
data,
what
you
know,
what
is
my
motivation?
What
am
I
going
to
get
out
of
it
is
is
what
I'm
thinking
I'm
just
putting
myself
in
the
shoes
of
a
user
versus
a
developer,
I'm
willing
to
share
my
data
because
I'm
gonna
get
x
out
of
it,
why
out
of
it?
F
So
that's
that's
where
I
was
coming
from,
but
yeah
you're,
absolutely
right.
If
you
get
more
data
points,
if
you
get
more
data,
you
don't
know
which
one
is
going
to
be
useful.
When
so
we
can,
we
can
figure
out
what
that
basic
set
should
look
like,
and
the
other
thing
that
I
also
want
to
add
is
that
you
know
gabby
mentioned
that
a
lot
of
you
know
fortune.
500
companies
are
willing
to
share,
you,
know
their
workload
profiles
and
then
performance,
data
and
stuff,
like
that.
F
It
might
also
be
worth
checking-
and
you
know
just
sending
an
email
out
in
the
user
list
and
in
the
developer
list.
Like
you
know,
what
is
the
sentiment
there?
Are
people
willing
to
do
this,
for
you
know
the
effort
that
we
want
to
put
into
optimize
ceph
for
their
benefit.
The
these
are
things
to
start
asking.
G
Agree,
neha,
but
I
think
got
the
answer
to
your
first
question:
what
the
users
get
from
this.
They
get
system
which
better
fits
their
pattern.
So
the
main
thing
that
they
they
get
by
sharing
the
data
is
that
this
data
would
be
used
in
order
to
improve
the
system.
So
I
think
that
maybe
it's
it's
not
symmetrical.
Maybe
you
know
the
the
biggest
biggest
companies
it's
easier
for
them
because
they
know
they
get
higher
priority
in
what
we
put
there
and
the
lower
customers
think
we
never
do
anything.
G
But
the
thing
is:
if
you
share
your
data,
then
the
system
will
be
optimized
to
your
workload,
but
I
I
totally
agree
with
what
you
said
about
the
email.
We
need
to
explain
what
are
going
to
get
out
of
it
and
and
ask
them
if,
under
these
conditions,
they
would
be
willing
to
share
information
and
anonymously
and
see
what
it
is.
But
I
think
that's
we
can
explain
what
what
is
the
the?
G
A
F
A
I
F
You
know
market
us
better
or
like
market.
This
whole
idea
a
little
better
to
like
get
people
on
board
and
start
with
something
we.
You
know,
we've
discussed
a
lot
of
things,
but
I
think
we
just
need
the
basic
skeleton
of
like
even
start
capturing
some
sort
of
performance
data,
and
then
we
can
keep.
You
know
increasing
that
set
and
optimizing
for
what
we
really
need,
but
we
need
people
to
hit
that
opt-in
button
and
for
that
we
need
to
market
this
better.
This
is
my
point.
A
F
D
I
mean
that's
a
very
valid
question
because,
usually,
if
such
thing
happens,
we
can
expect
even
media
sheet
storm
that
we're
collecting
unwanted
customer
data.
It
doesn't
have
to
be
any
merit
in
that,
but
it's
always
sounds
well
in
some
angry
propaganda
story,
and
could
we
even
know
what
handled
that?
Because
it
will
happen,
I'm
like
afraid
it
might
happen.
C
Yeah,
you
definitely
want
to
be
very
clear
and
very
explicit
about
what
we're
collecting
and-
and
I
agree
like
trying
to
inform
you
about
this
earlier
and
engaging
like
a
level
of
interest
and
whether
this
will
be
effective
at
all
yeah.
I
think
we
certainly
do
see
lots
of
folks
even
on
the
mailing
list,
who
are
very
happy
to
share
with
us
all
kinds
of
details
about
their
clusters,
but
that's
only
a
limited
sample
size.
Of
course,.
G
You
know
if
I
was
a
product
manager,
I
would
suggest
that
we
may
be
able
to
open
and
have
several
levels.
You
know
basic
or
more
advanced
levels
of
things.
So
from
the
basic
level,
we
could
get
some
information
that
helps
us.
So
if
you
send
more
information,
maybe
we
could
actually
get
enough
information
that
will
tell
us
how
to
optimize
your
system,
so
something
that
you
could
set
up
and
based
on
this
information.
G
G
We
could
do
you
know
great
things
with
it
if
we
think
about
the
creatives
in
a
creative
manner,
by
the
way,
the
reasons
I
say
that
if
you
have
several
levels
of
of
things,
it
makes
people
easier
to
push
the
opt
in
with
the
lower
level,
because
it
makes
clear
that
you
get
a
small
enough
information
that
nothing
just
could
reveal
too
much,
because
there
are
more
levels,
so
it
usually
helps.
If
you
have,
you
know
basic
advanced
and
super
people
push
the
basic
one,
it's
more,
it's
easier
for
them.
G
C
A
C
A
It's
not
necessarily,
I
don't
necessarily
care
if
they're
running
cd,
but
I
may
might
care
about
whether
or
not
they're
trying
to
like
do
like
try
to
have
logs
on
top
of
rpd.
B
A
A
C
Yeah,
that's
true.
I
just
thinking
of
things
like
they're
running,
if
you
want
to
say,
is
someone
using
rgw
for
like
a
machine
learning
workload
or
just
throwing
a
bunch
of
logs
or
or
some
other
large
data
set
kind
of
it's
much
more
difficult
to
understand
that
automatically
true.
I
I
Sorry
something
the
reason
you
want
to
know
what
you're
doing
with
the
clusters
is
in
order
to
better
understand
the
patterns
of
I
o
is.
This
is
the
standard
lying
reason.
A
Or
what
we
should
target
right
like
what?
What
are
our
performance,
optimization
targets?
What
do
we
care
about
if,
if
someone
is
doing
if
like
tons
and
tons
of
people,
are
trying
to
do
like
right-hand
logs
on
top
of
rbd,
for
whatever
it
is
databases
scd
whatever,
then
maybe
that
means
it's
like
a
use
case.
We
really
need
to
focus
on
like.
Maybe
we
need
to
figure
out
how
to
make
this
fast,
because
we're
not
really
good
at
it.
H
A
Now
we're
okay,
but
but
that's
what
everyone's
trying
to
do.
Maybe
that's
means
that
that's
what
we
need
to
be
focusing
on.
I
Yeah,
we
should
always
keep
in
mind
that
at
least
for
now,
the
users
who
updating
telemetry
are
the
more
savvy
users.
So
it's
not
necessarily
represents
all
use
cases.
So
that's
also
something
to
keep
in
mind.
I
J
One
more
point
back
to
nia
questions
what
the
users
can
benefit.
Maybe
we
could
also
gather
some
start
information
about
performance
and
build
some
expectation.
What
could
be
the
iops?
What
should
be
the
cash
hit
rate
and
so
on,
and
when
customers
send
us
information,
we
might
even
do
some
automatic
data
mining
go
and
come
back
to
them
and
say
you
know
what
I
need
an
email.
J
F
Yes,
that's
exactly
what
I
was
trying
to
say
when
I
said
we
want
to
market
this
better.
We
want
to
give
them
some.
You
know
benefit
like
if
you
opt
into
this.
This
is
what
we
give
you
right
and
that's
when
we
are
going
to
get
more
realistic
data
and
you
know
even
create
those
workload
profiles
that
mark
is
thinking
about.
A
C
Well,
I
think,
there's
a
lot
of
quite
a
good
toss-up
discussion
and
we're
kind
of
over
time.
Now.
Anything
else
folks
want
to
bring
up
the
last
minute
here
or
we
close
out.
C
Yeah,
so
that
I
think
that
there's
a
few
a
few
things,
but
I
think
the
first
thing
is
probably
identifying
like
the
the
minimal
set
of
things
we
want
to
start
capturing
and
I'll
start
working
on
that
and
creating
that
telemetry
and,
secondly,
I
think
the
idea
of
reaching
out
to
users
more
on
the
mailing
list
and
figuring
out
is
there
an
appetite
for
this
partnership
application
willingness?
There
is
to
share
this
kind
of
information.
A
How
about
some
kind
of
like
beta
opt-in,
where
they
agreed
to
just
share
whatever
in
the
cluster
like
we
don't
make
any
promises
about
like
anything,
and
if
there
are
people
that
are
willing
to
do
that,
then
we
can
start
just
getting
like
whatever
data
we
want
to
try
collecting.
We
can
experiment
with
it.
A
I
I
think
that
would
be
a
bit
problematic
with
the
current
model
of
how
we
collect
telemetry
with
yeah.
We
always
let
them
know
in
advance
what
we're
collecting.
However,
there
is
an
option
to
to
generate
a
sample
report
prior
to
sending,
but
I
doubt
that
operators
will
check
daily
how
we
or
sorry
after
the
upgrade,
how
we
collect
or
what
we
collect.
Now,
if
we
changed
anything
in
case,
we
haven't
put
anything
in
the
notes.
I
E
It
makes
me
nervous
because
it's
I
mean
there
are
probably
some
people
who
would
say
yes,
but
we're
like
asking
them
to
to
trust
us
and
I'm
sure
there
are
people
who
do,
but
I
think
for
all
the
other
people,
just
the
fact
that
we're
asking
it,
I
think,
would
make
them
trust
us
less.
I
think
I
think
we
would
have
a
net.
E
I'm
there's
this
there's
some.
What
is
it
and
mine?
Like
my
google
news
feed
this
thing
keeps
popping
up
about.
I
think
it's
like
audacious,
there's
some
like
audio
editing
program
and
they
like
have
a
telemetry
feature.
That's
also
opt-in,
but
there's
apparently
this
like
backlash
in
the
user
base
about
it.
We
haven't
seen
anything
like
that,
but
I
just
I'm
just
conscious
that
I
think
we
need
to
be
really.
E
E
But
I
mean
I
think
we
can
be
a
little
bit
more
deliberate
about
what
we
want
to
collect
and
we
can
put
it
in
a
separate
channel
and
like
say,
like
a
whole
bunch
of
performance
stuff
like
we
don't
think
that
it's
identifying,
but
you
could
imagine
in
some
circumstances
it
might
be,
but
it'll
be
very
interesting
for
us
to
learn
about
how
people
are
using
it
so,
and
we
can
leave
that
channel
off
by
default.
A
E
E
Yeah,
I
mean,
I
think
it's
just
like
almost
nobody's
gonna
turn
it
on.
I
think
that
probably
makes
sense
if,
like
there's
like
a
specific
question,
you're
trying
to
answer-
and
you
like
reach
out
to
a
bunch
of
specific
users
and
say,
can
you
guys
please
turn
this
on,
so
I
can
sample
you
guys
or
whatever.
A
E
But
I
mean
the
thing
there:
the
thing
the
challenge
there
is
that,
like
their
biggest
clusters,
are
like,
they
don't
upgrade
them
aggressively
like
they
have.
They
upgrade
all
their
smaller
clusters.
First,
until
they're
like
confident
and
they're,
they
might
even
still
be
on
nautilus
for
their
biggest
cluster,
not
sure
if
they've
talked
to
us
yet,
which
means
it's
like
a
two
year
turnaround
or
a
year
and
a
half
turn
around
before
we
actually
get
the
information
back.
So
it's.
E
I
don't
I'm
not,
I
guess
I'm
not
sure
telemetry
is
the
place
to
experiment
like
this.
If
there's
something
specific
that
we
want,
then
like,
we
could
like
right
help
them
script
up
to
custom
data
collector
or
something
like
specific.
D
For
is
a
data
sent
in
telemetry
report
easily
readable
to
human
I
mean:
can
we
build
the
confidence
that
at
least
most
of
the
data
that
we
sent
is
not
some
secret
stuff
just
uninteresting.
E
It's
json
and
it's
like
it's
pretty
intuitively
form
edited,
but
I
mean
there's
also
lots
of
numbers,
so
I
mean
you
could
imagine
that
we're
like
there's
some
stegographic
like
secret
stuff
that
we're
putting
there.
But
you
can
also
look
at
the
source
code
and
it's
not
doing
anything
crazy
there
so
or.
G
Something
you
could
make
you
know
if
you
json
or
any
other
things
you
could
make
label
for
each
number
or
put
more
labels.
It
just
would
be
some
more
data
over
there
over
the
wire,
but
it
would
be
like
a
15
numbers
that
someone
could
think
that
could
convey
something.
It's
the
output
label
for
each
number
and
people
will
understand
it.
E
G
A
One
of
the
things
we
talked
about
is
what
what
should
we
do
when
someone
regrets
their
decision
to
opt
in.
A
Maybe
it
would
at
least
help
alleviate
some
of
that.
The
issue
you
mentioned
with
audacity
they
can
immediately
just
purchase.
A
I
I
Wait,
we
already
passed
a
thousand,
but
we
went
back
so
yeah.
It
was
already
2010,
but
we
didn't
left
there,
but
yeah.
A
Like
to
maybe
through
labels
or
some
other
mechanisms,
say
that
this
cluster
fits
into
this
category
well
being
used
for
rgw
or
rbd
or
it's
you
know,
composed
of
ssds
or
hard
drives,
or
you
know,
there's
some
way
to
do
kind
of
like
clustering
of
this
data.
I
So
if
you
find
anything
specific
that
you
would
love
to
see,
please
let
me
know,
because
there's
a
lot
of
data
and
we've
been
putting
all
these
panels
according
to
what
was
necessary
and
what
you
think
it's
necessary,
but
and
there's
a
lot
more.
That
can
be
done.
So,
if
there's
anything
specific
that
you
want
to
see,
please
let
me
know
and
I'll
edit.
A
It,
what
would
be
really
cool
is
if
you
could
take
the
data
set
and
then
do
like
cluster
analysis
on
it,
like
data
mining,
cluster
analysis,
where
you
can
say,
here's
a
whole
list
of
attributes,
and
these
are
how
all
these
different
things
fit
into
it.
So
maybe
you
don't
even
know
what
question
to
ask,
but
all
of
a
sudden
these
there
are
interesting
trends
that
show
up
when
you
do
this.
I
Yeah
yeah
again,
we
need
to
talk
exactly
and
what
what
questions.
So
by
the
way
we
can
tell
that
a
cluster
has
rotational
or
flash
disks,
but
we
cannot
tell
more
than
that.
You
mentioned
specifically
devices.
So,
as
I
said,
the
disk
information
is
totally
separate,
so
there
is
only
so
much
we
can
answer
on
that
field.
So,
but
there
is
a
huge
metadata
table
for
each
one
of
the
clusters,
which
is
time
series,
so
questions
can
be
answered.
I
Python
and
the
database
is
postgres,
so
there's
a
lot
of
queries.
There,
lots
of
sql
and.