►
From YouTube: Ceph Performance Meeting 2022-02-03
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
So
all
right,
there
was
very
little
in
pull
requests
this
week,
I
didn't
see
any
new
or
closed
prs,
which
is
is
not
terribly
surprising,
given
that
everyone's
very
very
focused
on
quincy
there
was
one
that
was
updated.
That
I
saw
this
is
this
set
tracing
compiled
in
by
default
pr
that
folks
on
the
rbd
side
have
been
looking
at?
I
don't
think
we
have
diplica
this
week,
but
she
reviewed
that
and
it's
gotten
a
couple
of
different
updates
this
week.
A
So
it
looks
like
this
still
under
active
development,
we'll
see
what
the
outcome
of
that
is.
Coincidentally,
one
of
the
pr's
that
appears
to
maybe
have
caused
performance
regression
a
while
back
that
we
didn't
notice
was
was
also
related
to
tracing
so
keeping
a
good
eye
on
how
much
of
a
performance
regression
this
pr
causes,
even
when
disabled
is
probably
going
to
be
important.
A
Beyond
that,
I
did
not
see
anything
else
really
going
on
on
the
performance
front
this
week
for
for
new
peers.
B
Oh
go
ahead.
Absolutely!
Yes,
and
I
have
something
which
I
wait.
If
you
don't
see
it
yet
because
I
wait
because
of
quincy
the
work
on
quincy,
I
I'm
not
sure
it's
exactly
the
performance.
What
you
do
there
is
a
pr
there,
I'm
going
to
do
about
the
balance
of
the
very
rare
cases
in
which
the
balancer,
the
existing
ones,
that
the
calc
pidgeot
maps
is
really
stuck
with.
It's
not
infinite
loop,
but
it
works.
I
have
an
a
you,
an
example
where
it
works
for
more
than
10
minutes.
B
Actually,
one
function
call
it
really.
It
goes
into
some
huge
calculation.
I
have
a
very
simple
fix
that
changes.
This
reduces
to
less
than
20
seconds,
with
a
bit
very,
very
small
changes
and
verb
changes
in
the
out
in
the
results.
B
It
changes
a
bit
the
result
of
the
balancer,
probably
when
the
the
the
quincy
feature
fees
is
going
to
that
that
it's
going
to
be
more
quiet,
I'm
going
to
push
this
pr,
it's
like
maybe
12
lines
of
code
in
in
the
calc
pgap
maps,
but
it's
at
least
exam
for
the
example
that
I
have
about
this
huge
performance
issue.
B
I
did
again,
I
don't
think
it's
an
infinite
loop,
but
I
think
if
I
give
it
enough
time,
it
will
finish,
but
I
am
I'm
going
to
it
solves
all
the
use
cases
that
all
the
examples
actually
that
I
have
for
this.
B
It's
not
the
pg,
auto
scaler,
it's
the
balancer,
it's
the
capacitor,
okay,
okay,
so
there
is
one
I
have
actually.
I
have
like
dozens
or
something
like
this
different
configuration
files
from
different
systems
and
with
pools
which
are
worth
balancing
because
they
are
large
enough.
I
have
something
like
they
don't
have
the
statistics,
something
like
30
large
pools
that
are
worth
balancing,
because
all
kinds
of
small
pools
are
not
it's
not
really
interesting
and
out
of
them.
B
I
have
one
pool
with
one
configuration
which
causes
ironic
in
the
matrix
of
multiple
multiple
parameters,
but
I
have
see
I
see
six
examples
all
on
the
same
pool
where
this
the
compute
doesn't
stop.
So
I
I
give
it
a
timeout
and
the
largest
that
I
gave
is
600
seconds
10
minutes
and
it
still
doesn't
complete
and
after
my
fix
it
completes
in
less
than
20
seconds.
A
Oh
excellent,
did
you
say,
is
there
a
pr
yet
for
that,
or
is
it
still
just
in
a
branch.
B
A
You
can
submit
a
pr,
and
just
put
you
know,
tag
it
with
do
not
merge
or
something,
if
you're
not
ready
for
to
merge,
yet
just
to
get
more
eyes
on
it.
Okay,
okay,
cool
cool,
excellent;
so
were
there
any
other
pr's
this
week
that
that
were
new
or
closed
or
updated
that
I
missed
guys.
C
A
Did
you
put
it
in
the
in
the
the
pad
or
the
the
chat
window
or.
A
D
Yeah
so
yeah
we
have
been
testing
in
another
team
workload,
dfg
quincy
and
from
remember
mark,
I
talked
with
you
mark
it's
the
same
tracker
that
the
so
let
me
provide
some
background
to
this
one.
What
we
did
did
take
the
quincy
and
we
started.
We
deployed
it
in
a
192,
hd
cluster.
It's
a
hybrid
osd's
db
in
flash
like
nvme
and
the
data
disks
are
in
hdd
and
we
have
fill
workload
first.
D
We
fill
the
cluster
and
it's
a
small
object,
like
the
objects
start
from
1
kb
to
250
256
kbs.
Maybe
the
histogram
is
already
provided
in
the
tracker
and
it's
a
cost
bench
workload
s3
completely
rgw
workflow
and
we
have
30
drivers
cosplaying
driver,
which
means
that
around
160
number
of
workers
we
are
running
around
2100,
but
we
slowly
reduce
it.
We
were
thinking
that
client
is
being
putting
a
lot
of
load,
so
we
reduced
this
to
1680,
we
disabled
the
auto
scaler.
We
pre-sharded
the
bucket,
nothing
helped.
D
So
what
is
happening
is
that
reducing
the
count
of
the
workers,
the
cosmetics
workers
helped
us
to
pass
one
hour:
hybrid
upload,
but
not
the
48
hour
aging.
So
fill
is
going
fine,
but
not
I
mean
it's
not
adding
all
like
50
million
objects
in
each
pocket.
We
write
around
50
million
object
in
in
budget.
Some
few
objects
are
missing
from
each
bucket,
but
it's
still
okay,
because
it's
a
it's
just
a
fill.
It's
not
a
write.
It's
a
fill
in
the
cosmetic
term
client
doesn't
wait
for
a
write
to
be
successful.
D
It
just
keeps
writing
the
data
to
the
bucket
and
after
that
we
do
a
hybrid
which
is
a
one
hour
hybrid
and
then
followed
by
a
48
hour,
hybrid.
Whatever
we
have
been
doing
48
hour.
Hybrid
was
not
successful
at
any
point.
A
D
And
and
the
cluster
health
has
been
always
fine
like
no
saturation
or
anything
any
one
or
more
dying
or
nosd
flapping.
Nothing.
A
Have
you
tried,
after
it
gets
to
the
point
where
it
stalls
have
you
tried
sending
any
new
I
o
to
the
cluster
via
other
mechanisms.
D
No,
no,
no,
like
a
ccmd
or
like
put
bucket
or
create
bucket
something
like
that.
Right.
A
Casey,
do
you
have
any
sense
of
whether
or
not
you
think
anything
in
rgw
would
be
not
responding,
or
do
you
think
this
might
be
lower
down
in
the
osd.
D
D
C
Okay
yeah-
I
assigned
mark
cogan
to
this,
so
you
might
reach
out
to
him
and
offer
assistance,
but
he's
he's
also
been
tracking
some
memory.
Growth
in
rgw,
so
it'd
be
interesting
to
see.
If
we
see
the
same
thing
in
this
workload.
A
Thanks
yeah,
that's
that's
good
to
know
lots
of
changes
in
rgw,
but
also
some
changes
in
the
osd.
So
I
have
not
seen
anything
like
this
on
the
test
that
I've
run,
but
I
have
not
been
running
like
three
to
four
hour
kind
of
you
know
long
long,
running,
rw,
stress
test.
So
I
have
not
seen
this.
A
All
right,
well
yeah!
Thank
you
guys,
thanks
for
looking
at
this
anything.
B
A
B
One
point
regarding
the
discussion
that
we
had
last
week
about
the
rocks
db
and
this
we
talked
to
the
people
from
speedy
b.
They
also
saw
the
recordings
they're
willing
to
and
they
are
committed
to
opening
the
source.
We
talked
about
this.
They
are
willing
to
be
here
next
week
to
put
it
in
the
agenda
and
let's
see
what
how
they
can
discuss
with
people
from
from
digitalocean
and.
A
The
weather,
I'm
not
sure
if
next
week
is
going
to
work
intel,
is
going
to
be
giving
a
presentation
on
opencas
and
and
their
work
looking
in
comparison
to
dm
cash,
either
next
week
or
the
following
week.
I
don't
know
for
sure
I
gave
him
the
option
of
either
and
they're
going
to
try
to
figure
it
out,
so
we
might
be
able
to
do
it
next
week,
but
that's
still
maybe.
B
So
keep
me
in
the
picture
and
I'll
also
update
the
people
from
spdb
and
let's,
let's
do
for
the
next
week
and
the
week
after
we'll
do
both.
Let's
see
intel
decides
where
they
are
and
on
the
other
other
thirds,
they
will
bring
the
people
from
spdb
I'll
talk
with
them
and
make
sure
that
the
they
know
that
we
may
change
it,
because
we
have
a
prior.
A
Yeah
I
I
I
do
want
to
point
out
too,
that
you
know
we're
happy
to
talk
to
them.
Happy
to
bring
them
in
to
to
to
you
know,
present
anything
that
you
know
they
they'd
like
to
talk
to
us
about,
but
until
it's
open
source-
and
you
know
especially
given
that
blue
stores
is
kind
of
you
know
very
quickly-
heading
towards
more
of
a
stable.
A
Implementation,
while
we
develop
crimson
major
changes
like
replacing
rocks
dbr
are,
you
know,
definitely.
A
B
B
A
So
so
adam,
you
weren't
you
weren't
here
last
week,
but
the
the
the
folks
from
digital
ocean
raised
some
some
issues
that
they
were
seeing
regarding
rocks
db,
and
I
think
josh
was
was
thinking
that
maybe
the
speedy
b
folks
had
some
some
insights
or
or
thoughts
regarding
that
josh.
Do
you
want
to
talk
a
little
bit
about
what
what
you
were
thinking-
and
you
know
I
know
adam-
has-
has
already
looked
at
speedy
b
a
little
bit
so
maybe
it'd
be
a
good
discussion
to
foster.
E
B
Yeah-
and
I
got
it
from
my
person-
I
just
did
pattern
matching
between
what
the
people
from
digitalocean
said
and
what
we
heard
from
speedy
b
in
the
past.
I
didn't
fully
understand
their
problem,
but
it
was
clear
to
me
that
they
were
they
had
problem
with
tom
stone
and
with
the
delete
process
within
roxdb,
and
I
know
that
that
the
cbdb
claims
that
they
improved
it
significantly.
B
They
need
to
prove
it.
So
that's
why
this
whole
thing
started.
I'm
not
sure
I
I'm
not
sure
that
the
problem
is
mainly
in
speedy
b
or
I
I
didn't
even
have
a
feeling
how
confident
that
people
from
digital
ocean
were
that
the
problem
is
only
in
spdb,
not
in
other
places,
so
so
early
rocks
db.
Sorry
only
rocks
db,
not
in
other
places.
So,
but
if
it
is
there,
maybe
we
have
a
solution.
If
it's
not
there.
F
It
sounds
like
we
are
planning
to
invite
them.
Why
don't
we
get
the
digital
ocean
folks,
also
in
a
future
meeting
on
the
same
forum,
and
I'd
also
like
to
hear
their
plans
about
open
sourcing,
speedy
b.
I
think
that's
that'll.
G
B
Yeah
they
committed,
they
already
told
us
that
they
committed
to
a
customer.
I
don't
know
which
one
that
they're
going
to
open
sources
within
three
months,
so
they're
in
the
process
of
open
sourcing
it
without
us
creating
a
second
video.
B
We
have
the
problem
now
and
I
think
it's
good
to
to
see
the
josh.
B
I
I
think
that
if
we
have
a
solution
for
the
for
digital
ocean,
I
think
it's
it's.
It's
not
the
the
thing
they
came
with
a
specific
problem
with
they
claim
with
these
with
rocks
you'll,
be,
if
you
don't
have
a
solution
for
them,
you
leave
them
with
the
option
while
pdb
are
working
on
opening
the
source,
so
you
wait
until
it's
it's
published
in
order
to
start
tackling
a
problem
that
exists
now
for
important
user.
F
That's
the
biggest
problem
right
I
mean.
Let
us
assume
that
speedy
solves
the
problem.
How
do
they
even
consume
speedy
at
this
moment
until
unless
it's
open
source
right?
That's
where
the
problem
lies?
If
they
they
have
open
source
and
they
come
to
this
meeting
and
they
talk
about
how
they
have
done
it.
They
present
a
solution
that
somebody
else
can
try.
F
That
will
be
a
useful
thing
to
do,
but
if
there
is
something
that
is
far-fetched
and
we
make
promises-
and
some
like
you
know-
I
I
want
to
get
to
resolution
and
again
like
what
mark
said,
we
don't
want
to
invest
too
much
in
blue
store,
given
that
c
store
is
the
future.
So
I
I
think
it's
a
it'll
be
better
to
talk
when
when
it
is
open
source
and
then
we
can
have
somebody
try
their
solution.
A
Adam
the
the
gist
of
it
is
that
when
they
went
and
were
regularly
compacting
in
the
background
on
a
schedule-
and
it
gets
deeper
into
this-
but
the
very
high
level
when
they
were
regularly
compacting-
it
dramatically
improved
performance
for
them
in
some
situations,
rather
than
just
letting
you
know,
compaction
happen
on
right,
essentially,.
B
Josh
josh
here
from
digitalocean.
H
No,
no
I've
been
enjoying
the
conversation
yeah,
so
yeah.
The
the
issue,
essentially,
is
that
tombstone
and
overwrite
build
up
causes
such
a
degradation
to
list
performance
that
we
start
to
have
significant
index
performance
issues
leading
to
even
osd's
going
down
from
time
to
time.
A
I
don't
remember,
oh
go
ahead,
yes,
sorry!
No,
no!
You
go
ahead.
I
was
just
gonna.
Ask
adam,
do
you
remember
when
you
were
testing
speedy
b?
You
look
at
like
iterator
performance
or
anything
else
where
tombstones
were
having.
You
know
it
kind
of
a
big
effect
on
rocks
to
be
in
previous
things
that
we've
seen.
E
A
I
don't
I
don't
remember
when
we
talked
to
this
bdb
folks
before
I
wasn't
super
involved
beyond
at
the
very
beginning,
but
I
don't
remember
them
saying
that
they
had
a
solution
for
that
problem.
I
think
I
even
brought
it
up
specifically
that
that
we
wanted
to
be
able
to
reduce
the
impact
of
tombstones
on
iteration
performance.
I
didn't
think
that
they
they
said
that
they
did
that
better,
but
I
I
could
be
wrong.
That
was
just
just
my
vague,
probably
poor,
recollection
of
that
conversation.
I
A
Back
on
this
topic,
yep,
yep
and
okay-
so,
let's
move
on,
I
will
just
very
briefly
talk
about
quincy
performance
testing,
so
we
can
want
to
give
a
there's
a
in
the
the
ether
pad.
There's
a
spreadsheet,
a
link
there
folks
can
take
a
look
at
if
they
want
to,
but
the
the
high
level
of
it
is
that
compared
to
previous
releases,
read
performances
is
very
inline.
A
A
However,
on
these
amd
nodes
that
that
we
just
got
this
year,
so
we
haven't
been
testing
these
very
long,
there's
kind
of
a
clear
difference
between
going
back
to
nautilus
all
the
way
to
quincy,
in
some
cases,
we're
seeing
really
almost
progressive
degradation
of
performance
in
these
tests,
in
other
cases
like
at
the
very
bottom,
if
you
look
at
column,
q
row
93
you'll
see
that
there
there's
a
chart
for
4kb
random
rights
and
there's
this
this
kind
of
situation
where
novelis
was
really
good,
then
octopus
through
pacific,
we
weren't
doing
as
well
and
then
quincy,
we
kind
of
clawed
it
back.
A
So
I
went
back
and
have
been
doing
bisex
all
week,
trying
to
figure
out
what
happened
there.
The
reason
that
quincy
is
looking
good
compared
to
pacific
is
due
to
gabby's
excellent
pr
for
changing
the
allocator
behavior
and
and
getting
a
bunch
of
stuff
out
of
rocks
db.
That's
what's
giving
us
that
win
and
and
we're
actually
faster
than
nautilus,
which
is
good,
but
we
could
be
faster.
Yet
I
think
if
we
go
back
and
look
at
nautilus,
what
happened
is
initially
in
nautilus.
We
did
not
see
that
good
of
performance.
A
A
So
that
really
helped,
but
it
turns
out
that
that
that
change
is
backported
to
nautilus
when
we
implemented
it
in
the
pre
octopus
time
frame
it,
it
turns
out
that
right
around
the
same
time,
we
made
that
change.
We
alter
also
introduced
a
number
of
prs
that
actually
were
hurting
right
performance
that
it's
a
little
hard
to
tease
all
of
them
out,
but
the
one
right
now
that's
standing
out
is
this
pr,
where
we
introduced
changes
to
trace
points.
A
This
is
this
29674,
which
I
will
link
in
the
out
window
here
specifically
inside
that
pr
there
there's
like
10
commits-
and
it
appears
to
be
one
of
the
two
commits
I've
got
listed
in
the
ether
pad
that
are
doing
that.
That
was,
that
was
kind
of
the
biggest
most
straightforward
regression
that
I
saw.
I
think
there
are
others,
but
unfortunately
you
know
we're
talking.
A
Maybe
a
percent
or
two
at
a
time,
so
the
the
4k
monologue
size
change
was
a
big
win,
but
then
we
also
had
a
bunch
of
smaller
pr's
that
were
regressions.
This
is
exactly
the
kind
of
situation
that's
very
hard
to
tease
out,
especially
when
they
happen
kind
of
close
to
each
other
in
time.
A
So
the
good
news
was
that,
because
we
didn't
back
port
a
lot
of
those
other
regressions
to
nautilus,
we
only
saw
the
big
win
and
that's
why
it
really
stuck
out
in
these
graphs
or
these
charts
that
I
I
showed
my
hope
is
that
we
can
win
back
some
of
that
and
we'll
we'll
even
see
quincy
doing
a
little
better
in
these
these
right
tests
than
it's
doing
now.
A
You
know
either
getting
it
back
to
nautilus
and
some
of
the
the
larger
right
tests
or-
or
you
know,
being
an
even
bigger
win
in
the
small
random
write
test
where
gabby's
pr
is,
is
providing
even
better
better
wins.
So
that's
that's
what
I've
got
on
on
those
tests
and
any
comments
or
questions
there.
A
Okay,
if
not,
then
one.
F
One
quick
question
before
we
move
on
so
have
you
tested
any
rgb
workloads
yet.
A
I
have
I
didn't,
provide
them
here,
because
I
haven't
graphed
out
the
results.
I've
been
so
focused
on
these.
These
rbd
results
that
I
wanted
to
see.
If
I
could
identify
the
regressions
and
see
if
there
are
any
easy
wins
there
and
then
go
back
and
retest
rgw,
our
rgw
is
more
complicated
because
we
also
saw
some
regressions
in
rgw
itself.
A
You
know
that
has
big
performance
impacts
compared
to
rgw
right
now,
so
in
rgw
we
know
that
back
in
the
pacific
time
frame,
there
were
a
couple
of
regressions
that
were
introduced.
It's
probably
going
to
be
harder
to
tease
out
if
those
are
osd
or
rgw
regressions
specifically,
but
we
do
have
some
results
for
it.
So
I
can
I
I
probably
won't
dig
them
out
for
this
video,
but.
F
That's
fine,
that's
fine!
I
was
just
curious
because
then
the
the
stuff
that
we
started
off
this
meeting
with
the
quincy
that
vikkad
was
talking
about.
I
was
just
wondering
if
you've
seen
something
similar
or
not,
but
when
you
have
the
results
we
let's
talk
about
it.
Then,
let's
move
to
the
next
topic.
A
F
F
The
checker,
no,
no,
it's
good
stuff,
it's
good
stuff,
so
I
mean
in
general.
I
think
folks
already
know.
Giba
is
like
a
scale
cluster.
It's
a
mostly
logical
scale.
You've
got
close
to
1000
osds
running
with
very
less
resources,
especially
with
memory
and
mark
has
been
suggesting
and
even
tuning
the
the
cluster
to
behave,
or
at
least
hold
up.
F
I
would
say
not
behave,
hold
up
in
such
conditions,
so
one
of
the
things
that
we
wanted
to
do
was
to
use
this
cluster
to
run
some
kind
of
workloads
across
the
board
and
mark
you've
already
installed
a
bunch
of
stuff
that
we
that
can
help
us
run
cbd
on
this
right.
A
Yeah
is
cbt
needs
to
be
closer
one.
It's
not
a
problem.
The
only
thing
you
might
need
to
do
is
david
had
suggested
that
the
way
he
he
wants
people
to
do,
ssh
keys
here
would
be
to
use
the
the
forwarding
capability
and
to
do
that
with
pdsh.
A
There
needs
to
be
an
environment
variable
set.
I
just
it
shouldn't
be
hard
to
make
cbt
work.
That
way,
I'm
not
sure
if
we
can
do
it
easily
at
the
moment,
but
a
quick
pr
would
take
care
of
that.
So
we
just
may
need
to
make
it
a
slight
change
to
cpt
for
that
to
work
right,
but
otherwise
I
don't
see
any
problem
with
running
cpt
workloads
in
this
at
least
straightforward.
F
What
I
would
like
for
us
to
do
is,
I
think,
ashwarya
and
sridhar
they're
both
on
this
call
they've
been
actively
working
on
qos
background
qos
for
quincy,
and
this
will
be
an
opportunity
for
them
to
test
some
of
the
qs
related
settings
and
parameters
that
we
have
applied
for
quincy,
especially
around
how
this,
how
q,
how
m
clock
behaves
with
scrubbing
recovery,
while
io
is
running
on
this
cluster
at
this
scale,
what
they
have
done
so
far
is
tested
at
smaller
scale
with
ssds,
and
now
there
is
a
test
plan
that
we've
created
kind
of
to
replicate.
F
J
Okay,
I
can
go
ahead,
so
what
we've
been
working
on
is
currently
the
background
tasks
like
recovery
and
scrubbing
and
seeing
how
m
clock
handles
it
with
client
and
how
the
different
amp
clock
profiles,
work
and
we've
been
doing
this
on
the
official
analysis
nodes,
but
it
would
be
great
to
do
it
with
thousand
osds,
and
our
tests
currently
run
client
and
recovery
together
and
there's
a
new
test
coming
in
that
runs,
client
recovery
and
scrub
together
so
and
we
collect
some
stats
on
recovery
and
scrub
or
from
pg
dump.
J
So
we
would
really
like
to
see
it
on
a
larger
scale.
That's
basically
what
we
want
to
do
with
the
gebar
cluster.
F
So
I
think,
there's
a
test
plan
that
has
been
linked
here,
but
it's
a
private
document,
so
I
might
need
to
convert
that
into
an
ethernet.
F
You
can
do
that
offline,
but
I
guess
yeah.
You
had
something.
K
I
know
I
was,
I
was
just
gonna
say
that
you
broadly
outlined
the
the
test,
steps
for
for
each
of
the
tests
that
we
have
identified,
for
example,
the
recovery
test,
the
the
scrubbing
test
and
the
combination
of
the
scrub
plus
recovery
test.
So
these
are
high
level
steps
that
that
cbt
currently
does
so
once
the
one
cvt
is
up
and
running.
What
we
have
tested
so
far
is
at
a
scale
of
about
1000
objects
in
the
recovery
pool.
K
With
different
with
different
profiles,.
F
Do
you
wonder
for
folks
who
are
not
aware
of
what
these
different
m
clock
profiles,
or
do
you
want
to
quickly
describe
what
they
are
and
what
they're
meant
to
do.
K
Yeah
essentially
so
in
you
know,
clock
we
have
defined
three
profiles.
The
the
the
default
profile
is
called
as
the
height
line
tops,
which
gives
more
reservation
or
preference
to
the
client
operations,
while
still
giving
adequate
preservations
to
background
recoveries
and
scrub
related
operations,
and
the
idea
is
to
get
a
baseline
first
on
on
this
machine
and
then
switch
the
profiles
around.
For
example,
one
of
the
other
properties
that
we
have
is
called
high
recovery,
ops
profile
and
also
there's
a
balanced
profile.
K
So
with
high
recovery
profile,
for
example,
we
could
run
the
same
test
and
see
if
it
actually
helps
in
giving
higher
preference
to
recovery
ops,
while
still
allowing
the
client
ops
to
have
a
decent
balance
without
without
it
getting
affected
too
much
so
yeah.
These
are
the
three
three
basic
profiles
that
we
have
defined
and
the
plan.
L
K
Establish
the
baseline
first
and
then
test
the
different
test
cases
with
these
profiles
and
extract
the
numbers
to
see
how
clock
is
varying.
F
Oh
so
mark
and
everybody
else,
I
mean
what:
how
does
this
sound
and
if
there
are
any
thoughts,
ideas
incorporations
that
we
can
make
into
this
I'd,
be
curious
to
know
what
those
are.
A
M
No
no
yeah!
I
I
I'm
happy
about
about
this
and.
F
I'm
sure
scrubbing
this
scrub
test
makes
me
rather
more
happy.
M
A
We
should
we
should
we've
got
lots
of
peers.
We
should
get
into
cpt
that
are
sitting
out
there,
but
we
should
definitely
get
the
both
any
kind
of
scrub,
testing
and
m
clock
changes
in
if
there's
still
an
outstanding.
It's
it's
neat
that
you
guys
have
figured
out
ways
to
do
this.
M
I
have
a
related
question
about
that
just
to
understand.
Currently
there
are
a
lot
of
weights
etc
within
the
crop
code,
some
of
they
are
disabled.
When
the
m
clock
is,
is
the
scheduling
method
meter
that
is
chosen,
but
do
we
envision
a
time
when
we
can
remove
the
need
for
any
any
specific
manual,
scheduling,
manual,
delays
and
just
assume
we
have
mdm
clock.
K
Yes,
I
think
the
the
idea
is
to
remove
all
the
manual
configurations
that
we
add,
for
example,
the
scrub
delays
and
all
that
the
idea
is
to
dis
disable
all
of
them
and
let
m
m
clock
based
on
the
profile.
Let
it
do
the
the
allocations
and
and
let's
scrub
go
as
per
the
setting
that
we
have
so
the
long
term
of
sure
we
we
want
to
eliminate
all
those
scrub
sleeves,
for
example,
that
we
have
defined
and
clock
do
its
work
without
any
of
these
settings.
K
Well,
currently,
the
way
the
way
the
code
and
the
code
things
have
been
made.
Even
if
we
add
those
settings,
the
m
clock
code
overwrite
those
back
to
the
to
zero
yeah.
Well,
basically
yeah
it
disables.
It
essentially.
F
So
I
guess
the
question
is
about
when
that
new
code
lands
in
right.
It's
a
matter
of
maturity,
like
m
clock.
This
is
the
first
release
where
m
clock
is
going
on
as
default.
If
things
look
good
and
we
we
get
much
more
confidence
and
like
m
clock
is
doing
the
right
thing
and
we
don't
need
to
like
you
know
these
sleeps
are
all
associated
with
wpq,
which
is
the
old
osd
op
q
that
we
used
to
use.
F
At
some
point
we
can
just
say:
okay,
we
don't
care
about
wpq,
because
m
clock
is
far
better
and
at
that
point
we
don't
need
those
sleeps
anymore.
So,
like
it
comes,
it
all
boils
down
to
that
new
code.
If
it
is
targeted
for
something,
let
us
say
the
r
release
sure
we
probably
still
need
those
sleeps
implemented,
and
you
know
we
accounted
for
just
in
case.
F
Somebody
needs
to
switch
to
wpq
for
performance
reasons,
but
if
you're
talking
about
you
know
further
down
the
line,
maybe
not
so
I
guess
it's
about
timelines.
M
F
Anybody
else
have
any
other
thoughts,
and
I
know
there
are
other
folks
who
are
doing
similar
tests
on
different
scale,
different
kinds
of
workloads
and,
I'm
sure,
there's
a
lot
of
learning.
We
can
get
out
of
each
other's
experiments
that
we
are
doing
so.
I
was
hoping
that
if
this
this
could
be
our,
I
don't
think
cbt
has
been
used
at
the
scale
ever
at
least
in
my
knowledge.
So
if
we
can
get
cbt
to
run
on
a
thousand
osd
cluster,
maybe
that
can
be
the
workload
generator
that
we
can
use
very
easily
across.
A
Yeah,
I
think
the
biggest
I've
ever
done
is
probably
four
or
five
hundred
I've
never
done
a
thousand.
I've
used
pdsh
on
a
thousand
notes
before,
but
not
cbt,
so
that'll
be
interesting
to
see
how
it
goes
milestone,
right,
yeah,
yeah
and
it'll
be
easy.
This
is
almost
a
little
bit.
Money
then
doesn't
actually
have
to
deploy
the
osd's
spawning
clients-
let's
basically
just
invoking
pdsh
but
yeah
that'll
be
interesting.
D
F
Yep,
that's
that's
an
interesting
point
so,
like
we
are
trying
to,
you
know,
spread
the
load
across
different
groups
so
that
we
can
get
a
good
idea
of
how
m
clock
is
doing
across
all
kinds
of
operations.
Background
operations
that
the
usd
does
so
what
sridhara
and
aishwarya
are
going
to
be
focusing
on
are
more
like
client,
io
versus
recovery
and
scrubbing,
and
wiki
and
team
are
doing
some
pg
deletion
performance
evaluation,
also
with
m
clock.
D
Generic
currency
and
then
pg
deletion
and
then
tell
me
one
thing:
what
is
the
default
profile?
It
is
high
io
or
it
is
balanced.
It's.
D
Do
we
have
enough
documentation
in
upstream
for
this
feature,
I
mean
like
what
these
profiles.
F
This
is
this
is
one
thing
I
can
say:
yes,
because
every
time
we
have
merged
m
clock
change
or
m
clock
pr,
there's
documentation
going
with
it.
I
why,
while
we
continue
with
the
meeting,
I
can
paste
some
of
the
links
in
the
chat
for
reference,
so
maybe
srider
you've
written
a
lot
of
this
documentation.
Maybe
you
can.
F
And
sridhar
to
take
on,
let
me
know,
and
then
we
can,
you
know,
execute
some
of
the
test
plan
that
they
have.
A
Sure
sure
that
sounds
good
I
can.
I
can
try
to
get
that
taken
care
of
so
that
they're
not
blocked
it
shrieker
or
aishwarya.
Do
you
do
you
have
a
particular
workload
that
you're
you're
interested
in
fio
or
beta's
bench
or
what?
What
do.
A
That's
right
here,
so
I've
got
fio
installed
on
most
of
the
nodes.
There
were
a
couple
that
were
down
and
one
that
was
stuck
on
on
on
rail,
eight
and
there's
not
really
on
on
centos
eight
instead
of
sent
us
stream,
and
that
was
causing
problems
with
the
young
rippo,
but
other
than
that.
I
think
fio
should
be
on
all
the
nodes.
Now
do
you
have
a
specific
amount
of
clamp
workload
that
you
usually
try
to
invoke
or
is
it
you
know?
K
Generally,
we
we
have
right
now.
The
way
we
have
tested
is
just
using
one
one
client.
So
I
I
guess
we'll
have
to
just
do
some
experiments
to,
like
you
say,
keep
the
cluster
busy
on
one
pole
with
client
tops.
While
we
are
triggering
we
have,
while
we
are
triggering
the
recovery
and
the
scrub
operations
and,
for
example,
some
other
food.
A
Oh
sure,
for
scrub
up
and
recovery,
typically
in
cbt
for
recovery
operations,
I
cpt
at
least
from
where
I
remember
it's
been
a
while
since
I've
done
it,
but
I
think
we
need
to
own
the
osd
to
do
it.
A
K
As
part
of
the
testing,
we
did
introduce
a
new
kind
of
recovery
test,
so
that's
the
test
that
we're
gonna
trigger
it's
already
there
in
cbt.
Actually
there
are
two
types
of
recovery
tests,
one
one
one
that
you
already
had
mark
and
then
one
that
we
nearly
introduced
to
test
victim
clock.
A
And
the
new
test
is
that:
will
that
work
on
an
existing
cluster
with
like
the
use
existing
flag
or
when
you
test
that?
Are
you
testing
it
on
a
cluster
that
cbt
deployed
using
the
the
the
the
ceph
cluster
class.
L
A
That
that
might
be
an
interesting
thing
to
see.
I,
I
don't
believe
my
existing
recovery
tests
work
when
you
do
use
existing,
because
it's
not
aware
of
the
osds
when
when
use
existing
is
set
it
you
know,
the
osds
are
basically
just
you
know,
assumed
to
be
there
and
it
doesn't
try
to
touch
them.
So
that
is
actually
an
interesting,
interesting
question
now
that
that
could
be
a
little
bit
of
a
wrinkle
here
with
with
testing
when
the
cluster
is
already
pre-deployed.
F
What
does
it
do
differently?
I
mean:
what
can
you
describe
for
me?
What
is
it,
what
does
that
recovery
test
do
like?
Let
us
say
if
we
induce
recovery
by
just
bringing
osds
down
manually
so.
A
F
A
Oh
yeah,
if
you
brought
them
manually,
that'd
be
fine.
I
so,
like
I
said,
it's
been
a
long
time
since
I've
looked
at
this
code
and-
and
since
I
I've
thought
about
this
but
vaguely
I
remember
that
things
in
the
the
self
cluster
class,
like
the
recovery
estate
machine
work,
assume
that
cbt
has
knowledge
of
the
typography
of
the
cluster,
and
that
happens
a
topology
of
the
cluster,
and
that
happens
when
it
knows
and
did
the
deployment
itself
like,
then
it
knows
about
osd's
knows
where
they
are.
A
It
knows
how
they
were
deployed
when
use
existing
is
set
that
all
basically
is
is
empty.
There's
nothing
there.
It
doesn't
know
anything
about
how
the
cluster
is
deployed.
It's
just
running
some
benchmarks
against
it.
I
don't
know
that
the
the
stuff
that
I
wrote
for
recovery
will
work
when
music
system
is
set.
A
I
don't
think
it
does,
could
be
wrong
on
that,
and
maybe
it
might
be
possible
to
change
it,
so
it
does,
but
it
is
primarily
due
to
the
fact
that
cbt
doesn't
have
any
knowledge
about
how
many
usds
there
are,
how
they
were
deployed
and
what
it
should
do.
It
it's
possible,
though,
that
you
could
you
know
if
it
it
might
work,
but
if
it,
if
it
doesn't
work
it's
possible,
you
might
be
able
to
change
it.
So
it
just.
F
Yeah
yeah,
that's,
I
think,
that's
what
I'm
thinking.
Maybe
that's
something
breathing
for
you.
You
can
verify
in
your
local
setup
how
this
works,
but
even
like
I'm
thinking
in
the
same
direction,
there's
nothing
that
stops
us
from
just
manually,
injecting
failures
and
let
the
the
let
cbt
just
you
know,
collect
the
stats
or
or
even
see
how
long
the
pgs
are
taking
to
recover
and
all
that
kind
of
stuff
that
they
have
added.
A
Yeah,
it
might
not
even
be
a
hard
change
right.
I
think
I
just
didn't
I
when,
when
use
existing
was
was
kind
of
created
so
that
people
could
run
on
existing
clusters.
It
was
primarily
for
like
running
a
benchmark
on
like
a
a
partner
setup,
so
like
say,
if
super
micro
had
some
machines
in
their
their
lab
that
they
wanted
to
test
and
like
verify
that
that
they
saw
good
performance
on
it.
A
That
was
kind
of
the
idea
that
someone
could
just
deploy
stuff
and
then
run
cbd
against
against
an
existing
cluster
like
that
it,
it's
probably
the
biggest
issue
is
to
say
I
don't
think
he's
ever
been
tested
so
that
that's
probably
the
first
thing
to
just
try
and
see
if
it
works
or
not.
K
Sure
I
can
we
can
try
to
run
it
on
our
local
setups
and
see
if
it
works,
and
it's
not
really
sure
to
see
if
we
can
get
it
working
with
the
flagship
yeah.
A
Yeah
and
hopefully
fairly
quickly,
I
should
be
able
to
get
something
running
on
an
example
configuration
running
on
on
on
giba,
so
that
you
guys
can
do
testing
there
too
I'll,
probably
just
set
up
like
16
client
nodes
that
that
you
can
use.
If
you
want
to
for
for
workload,
this
should
be
sufficient
to
really
saturate
the
cluster.
I
think
given,
given
the
speeds
we're
talking
about
here.
A
F
A
A
I
don't
know,
but
I
didn't
want
to
bother
with
all
that,
so
I
I
recreated
the
the
give
a
memory
full
memory
configuration
that
we
kind
of
set
up
on
official
analysis
and
just
ran
some
cbt
benchmarks
there,
both
looking
at
the
things
that
we
set
and
then
also
looking
at
a
also
saying
the
tc
malik
threat,
cache
260
megabytes.
Instead
of
128
megabytes
and
the
overall
memory
usage
looked
very
similar
to
gibba.
A
We
ended
up
right
around
a
gigabyte
before
the
thread,
cache
changes
and
a
little
over
like
925.
I
think
when,
when
we
reduced
the
tc
malik
thread
cache
and
it
was
fairly
stable
through
4k
random
writes,
which
is
usually
a
fairly
decent
test
to
invoke
osd
memory
growth.
A
There
are
situations
we
can
use
more
for
sure,
especially
we
found
nana
around
the
the
pg
log
or
sorry
pg
splitting
that's
a
situation
that
could
potentially
invoke
significant
oc
memory
growth,
but
this
got
us
really
close
to
what
we
were
seeing
in
gamo,
which
is
great
there's
a
little
more
to
it
that
I
was
able
to
investigate
on
osd
startup.
A
A
The
rss
memory
usage
after
we
told
tc
melodic
to
release
memory
was
around
50.,
so
figure
at
startup.
That's
kind
of
where
we're
setting
is
in
that
ballpark
of,
like
you
know,
maybe
30
to
50
megabytes
of
rss
memory
usage
somewhere
around
that
once
we
invoked
an
rbd
pre-fill
workload,
which
is
essentially
four
megabyte,
writes
immediately.
Osd
memories
just
shot
up
to
around
450
megabytes
and
then
progressively
grew
from
there
as
we
started
filling
in
stuff
up
to
peaking
somewhere
around
480
to
500
megabytes
of
memory
usage.
Our
asses
memory
usage
specifically.
A
After
the
pre-fill
finished,
then
this
test
started
to
4kb
random,
write
workload
and
memory
again
shot
up
to
about
900
megabytes
to
a
gigabyte,
depending
on
the
tcml
thread,
cache
setting.
So
we
saw
this
the
significant
growth
as
soon
as
we
started
doing
small.
I
o-
and
this
is
all
very,
very
in
line
with
what
we've
seen
in
the
past,
especially
in
relation
to
the
o
node
cache
in
blue
store.
We
see
that
the
o
node
cache
especially
seems
to
cause
a
lot
of
memory
fragmentation,
but
it's
not
the
this
whole
thing.
A
There's
definitely
other
stuff
in
here
where
we
see
memory
growth
well
beyond
what
we
see
based
on
the
mempool
counters
and
based
on
some
other
things.
It's
it's
pretty
interesting.
So
I
I
went
back
and
tried
to
look
at
using
the
tcml
keep
profiler
after
we
had
kind
of
done.
Some
of
these
different
workloads
and
the
results
are
in
the
ether
pad
there.
A
Since
we
don't
have
a
lot
of
time,
I'm
not
going
to
open
up
and
just
walk
through
them,
but
but
feel
free
to
take
a
look
if
you're
interested
there's
some
really
interesting
stuff
there.
I
will
say
that
the
in-use
numbers-
I
don't
trust
that
tc
malloc
is
claiming
or
p-prof.
One
of
the
two
is
claiming
that
we're
using
way
more
memory
or
have
way
more
in-use
memory
than
than
we
do.
So.
I
suspect
that
maybe
it's
missing
some
possibly.
A
I
trust
the
alloc
data
a
lot
more.
That
looks
a
lot
closer
to
what
I
would
expect
to
see
and
it's
very
high.
We
do
a
lot
of
memory,
allocation,
same
amount
of
memory
allocation,
but
that's
very
similar
to
what
I've
seen
in
instrumenting
the
osd
in
other
ways
previously-
and-
and
it's
really
interesting-
I
mean
buffer
list-
is
you
know
clearly
the
the
big
thing
here
right?
We
we
allocate
tons
and
tons
and
tons
of
memory
and
buffers.
So
that's
really
big,
but
but
there
are
some
other
interesting
things
there
too.
A
Finally,
radic
and
I
sat
down
a
while
ago
back
in
2020,
I
think
and
tried
to
create
a
pur
per
thread
ring
buffer
for
specifically
for
the
use
case
where
we're
pending
memory
like
little
appends
for
encoding,
and
I
went
back
in
and
started
testing
that
with
master
again,
and
it
turns
out
that
if
we
limit
the
size
of
memory
allocations
for
the
ring
buffer
to
be
within
like
64k
or
you
know,
with
an
8k,
we
can
actually
get
a
little
bit
of
a
memory
usage
gain.
A
By
doing
that,
even
though
we're
allocating
more
memory
per
thread
for
that
ring,
it
lowers
fragmentation
enough
that
we
see
a
small
win
out
of
it.
It
looks
like
on
the
order
of,
like
you,
know,
8
to
16
megabytes.
Maybe
I
pretty
consistently
saw
in
the
test
the
configurations
I
was
looking
at
that
memory
usage
was
down
a
little
bit
despite
allocating
more
memory
for
this
thing,
so
that
was
really
interesting
to
see,
but
the
pens
aren't
the
big
consumer.
A
We
we
improved
that
a
couple
of
years
ago
by
changing
the
way
that
that
that
works,
so
that
we
don't
just
do
like
lots
of
tiny
little
appends
for
the
pen
hole
use
case.
We
we
grow
it
if
there's
lots
of
little
ones,
kind
of
the
way
the
vector
works
in
sql
plus.
So
that
does
not
appear
to
be
the
the
real
win.
A
Maybe
the
real
win,
I
suspect,
would
be
to
change
the
way
that
the
messenger
works,
to
allocate
a
block
of
memory
and
then
not
move
like
allocate
memory
for
for
buffers
and
then
move
them,
but
rather
to
just
have
pre-allocated
memory
and
then
use
them
and,
like
you
know,
send
a
pointer
around
or
whatever
that's
a
lot
of
work.
A
I
don't
know
if
it's
worth
it,
but
if
you
look
at
those
prof
profiles,
the
the
messenger
allocations
seem
to
be
a
fairly
big
percentage
of
the
overall
allocation
behavior
along
with
other
things,
you'll
see
it
in
there.
So
that's
that's
what
I'm
looking
at
here.
I
don't
know
what
the
right
route
is
to
reduce
our
memory
usage
further
right
now
we're
hitting
kind
of
a
hard
wall
at
900
megabytes
rss
in
the
sub.
A
We
could
probably
make
it
a
little
smaller,
but
fragmentation
is
a
pretty
big
problem
to
lure
further.
I
think
and
that's
what
I
got.
F
F
What
kind
of
behavior
do
we
see
or
where
the
bottlenecks
lie
and
how
further
do
we
need
to
go
kind
of
stuff.
A
I
suspect
that
what
we've
done
is
we've
lowered
the
steady
state
memory
usage
right,
like
we've
done
all
these
things
that
are
really
clear
consumers
of
memory,
and
now
our
steady
state
looks
lower.
You
know
we
got
it
down
to
like
a
gig
right,
but
you
know
if
there
are
things
that
consume
memory
that
we're
not
controlling.
Here
we
could
still
see
big
spikes.
I
A
Exactly
exactly
and
and
yeah
fragmentation
is
awful,
I
mean
this
is
tcml,
does
about
as
good
of
a
job
as
we've
ever
seen
in
terms
of
controlling
and
and
dealing
with
stuff's
behavior
memory
allocation
behavior,
our
our
penchant
for
for
allocating
little
things
all
over
the
place
and
and
it
even
tc
malik
struggles
with
what
we
do.
A
Certainly
lipsy
malik
deals
with
it
far
far
less
gracefully
than
tc
malek
does.
So.
If
we
want
to
make
this
better,
if
we
really
want
to
make
it
better,
it
probably
means
changing
the
way
that
we
allocate
memory.
A
Gabby,
I
am
going
to
call
you
just
briefly
here.
I
know
you
were
kind
of
interested
in
some
of
this
kind
of
stuff.
Does
this?
Does
this
sound.
P
L
L
L
Osd
would
still
do
single
allocation,
update
robsdb
once
and
then
send
the
data
to
be
written
to
glue
store,
but
by
passing
rocks
to
be
and
bypassing
other
location.
So
I
I
know
it's
very
tricky
to
do
it's.
It's
really
easy
to
say
this,
but
I
know
that
the
detail
is
going
to
be
held.
So
that's
why
I
suggested
different
approach.
It
could
be
done
by
them,
but
it's
of
course,
going
to
be
less
useful
for
us.
A
Yeah
I'll
take
a
look
gabby,
you
know,
even
just
even
if
it
doesn't
fix
any
problems
just
having
somebody
going
through
and
looking
at,
where
we're
allocating
memory
and
where
we're
just
having
a
really
clear
view
of
of
what
kind
of
behavior
we're
requiring
of
tc
malik
and
updating
that
to
make
sure
we
understand
what
that
really.
L
L
Yeah,
so
I
I'm
still
not
sure
that
this
thing
would
be
possible
to
do
or
if
it's
possible,
to
make
really
into
something
functioning,
but
it
might
just
be
able
to
prove
where
the
money
is
so,
hopefully
they
will
be
able
to
do
part
of
this.
I
really
try
to
break
it
down
into
many
steps
that,
even
after
the
first
step,
they
have
something
they
could
present,
because
it
might
be
that
even
the
first
step
would
prove
to
be
too
difficult
to
do,
and
I
also
want
your
opinion
on
something
else.
L
L
M
L
A
L
L
I
can't
put
you
on
on
on
the
menu
list
there,
but
it's
really
something
looks
unreasonable.
I
don't
expect.
I
mean
I
understand
that
if
the
map
is
very
big
once
there
is
all
of
failures,
disconnect
fell
over
fall
back,
the
calculation
is
going
to
be
polynomial.
A
L
B
L
B
G
G
G
It's
looked
up
if
the
osd
map
has
a
pg,
temp
or
or
one
of
the
various
other
overrides
we
have,
but
it,
but
otherwise
it's
a
calculation
and
we
don't
cache
it
and
it
is
not,
and
it
is
the
whole
point
of
of
crush
in
the
osd
map
is
to
not
encode
the
lookup
in
the
mav.
It's
a
calculation,
oh
greg.
L
Greg,
let
me
just
put
something
into
perspective,
and
actually
this
thing
might
be
a
a
good
way
to
approach
it.
Ibm
is
trying
to
put
a
safe
client,
safe,
rbd
client
on
a
smart
nic,
and
the
smart
nic
has
a
very
what's
the
word,
a
very
powerless,
not.
L
So
one
thing
we
tried
to
suggest
was
the
the
the
solution
that
we
are
trying
to
push.
Also,
that
is
currently
under
development,
assumes
that
every
ost
going
to
have
a
gateway
solution
where,
when
the
io
arrive
to
it,
it's
going
to
do
a
full
verification
and
if
it
belongs
to
it,
it's
going
to
push
it
down.
Otherwise,
it's
going
to
be
forwarded.
R
G
That's
that's
fine,
but
it
is
not
how
it
works
right
now
to
answer
your
question
mark
you
know,
10
or
12
years
ago
it
was
on
the
order
of
10
microseconds
with
the
maps
that
we
tested
it
on.
I
don't
know
how
dependent
that
was
on
the
map.
Complexity.
If
they
got
more
complicated
or
not.
I
don't
know
how
much
it
scales
with
cpu
frequency.
G
A
G
When
someone
goes
we're
spending
some
time
in
crush
yeah,
you
can
try
it
from
any
caching
and
then
they
don't.
A
Yeah,
I
I
couldn't
remember
if
in
greg's,
I'm
sorry
and
sage's
thesis
if
he
actually
talked
about
the
time
complexity
or
not,
but
it.
G
A
B
G
L
How
big,
how
big
configuration
are
you
talking
about.
L
G
L
Configuration
once
the
configuration
grow
to
1000,
osds
and
60
nodes,
then
it
become
what
was
this
40
millisecond?
L
G
I
mean
the
last
time
I
saw
it
get
really
long.
I
think
it
was
as
a
result
of
I
don't
remember
what
was
triggering
it,
but
but
we
were
getting
retries
on
on
lots
of
so.
L
A
G
The
last
time
it
got
really
long
or
that
we
saw
it
get
really
long
and
and
did
something
about
it
was
as
a
consequence
of
out
osd
still
being
in
it.
So
you
hit
lots
of
free,
tries
and
back
offs,
and
you
know
conflict
pretty
badly
and
I
think
that
might
have
been
that
might
have
been
partly
resolved
in
one
of
the
straw
iterations.
I
I
don't
remember.
G
It
was,
I
think
it
was
fixed
in
one
of
the
straw
configurations.
At
least
one
of
them
was
actually
about
waiting
being
wrong.
I
don't
know
if
another
one,
I
don't
remember,
yeah
yeah,
I
mean
the
straightforward
answer
to
this
is
yes,
it's
probably
worth
I
mean
the
problem
is
it
depends
on
the
number
of
pgs
that,
like
the
finest
tax
thing,
it's
definitely
worth
like
looking
at
cashing
the
crush
calculations
again,
but
no
one
has
done
that
yet
because
it
has
never
quite
been
worth
it.
L
Okay,
but
now
on
your
side
mark,
are
you
guys
testing
this?
On
the
client
side,
I
mean,
even
if
it's
going
to
be
25,
40
micro
seconds,
it's
a
lot
of
money
when
you're
talking
about
nvme.
A
Ssd
client-side,
rbd
and
cfs
have
probably
been
the
fastest
in
terms
of
single
client
performance
and
on
the
rbd
side,
which
is
probably
what
we've
got
the
most
experience
with.
Looking
at
at
long-term
testing,
we
are
far
more
limited
by
the
implementation
of
the
client-side
rbd
cache
than
we
were
by
crush
crush.
Very
much
was
not
the
majority
consumer
of
of
time
compared
to
other
things.
But
having
said
that,
these
are
on
configurations
that
are
like
between
you
know,
maybe
eight
and
up
to
a
hundred
osds
we've.
L
A
A
L
Maybe
we
should
try
and
get
the
numbers
because
it
might
trigger
some
changes
to
this.
Maybe
this
gateway
design
that
we
are
trying
to
push
on
the
ibm
cloud
side.
Maybe
that
thing
could
be
actually
a
general
purposing
or
maybe
it
could
be
used.
If
you
have
more
than
any,
then
then
some
or
is
this,
maybe
you
say
you
know
what
we
could
do
flat
calculation
if
you
have
no
more
than
I
don't
know,
256
osds.
E
E
If
you
had
like
a
binary
tree
of
hierarchy
versus
having
everything
just
on
same
level,
because
if
you
have
thousand
is
this
on
one
level,
then
it
will
peak
from
all
of
them
and
there
will
be
thousand
challengers
for
a
specific
seed
to
get
being
chosen
for
a
pg
set.
If
you
have
that
tree,
it
will
just
reduce
that
set.
E
A
That
would
not
surprise
me
adam.
That
seems
very
like
a
good,
very
good
observation.
A
L
A
But
I
would
I
will
say
that
especially
like
a
smart
nick
right,
if
you're
going
for
a
really
high
ad
up
single
client,
big.
L
L
A
We
don't
have
any
mechanism
right
now
for
like
reducing
an
object
in
a
pool
to
be
restricted
to
only
like
a
subset
of
the
osds
that
are.
Are
there
that
I
know
of
anyway,
like
we
can't
just
have
like
you
know,
a
an
rbd
pool
where
a
given
block
is
represented
by
you
know
a
small
subset
of
the
the
the
osds.
L
L
G
Depend
the
the
variable
should
depend
on
the
number
of
buckets
you
have
to
descend
through
and
how
many
times
you
have
to
back
off
and
retry.
I
don't
think
the
number
of
osds
has
an
absolute
number
matters.
I
guess
you
know,
given
that
they
tend
to
be
divided
into
racks
and
rows
and
stuff,
probably
there's
like
a
logarithmic
scaling,
as
you
split
those
up,
but
but
in
general
it's
just
about
it's
just
that
the
number
of
times
you
have
to
go
through
a
bucket
like
a
crush
bucket
and
run
and
run
a
hash.
G
L
L
G
I
mean
nobody's
gonna,
know
off
hand,
someone
will
need
to
go.
Look
at
the
crush
code
again
and
look
at
the
script
you
ran
and
identify
if
it
matters
matches.
If
you
didn't
just
run
crush,
I
would
just
like
make
up
a
couple
of
maps
of
varying
complexity
and
run
crush
on
them
in
the
environment
you
care
about.
You
can
like
there
are
lots
of
ways
to
invoke
that,
using
the
crush
tool
and
and
and
the
testing,
and
it
has
testing
built
in.
G
L
L
G
L
A
L
Should
we
add
you
guys
to
the
to
the
to
the
mailing
list
there
or
you
guys,
won't
be
interested,
I
mean
mark
you'll,
probably
have
no
way
to
escape.
It.
L
G
A
Gabby
is
it
possible
that
they
could
have
a
live
session,
so
we
can
actually
do
like
real
work
here
on
this
and
I'll
just
talk
about
it,
yeah
sure,
okay,
let's,
let's
do
that
and
then
let's,
let's
just
profile
this
thing
and
see
where
spending
time.
That
will
give
a
good
indication
where
to
code
and
where,
in
the
code
to
start
digging
in
and.
L
A
They
can
do
it
too,
like
the
the
thing
I
just
I
mean
they
should
have
perf,
they
should
have
rpms
for
perf,
so
they
can
run
that
directly
and
then
my
thing
should
be
really
easy
to
compile.
As
long
as
they've
got
like,
you
know,
gcc
and
and
healthy
tales
on
the
system
which
they
probably
do
so
they
work.
L
A
L
L
A
L
So
I
don't
know,
I
don't
know
exactly.
Actually
I
don't
know
at
all.
How
would
the
map
look
for
thousands
of
these?
How
big
would
it
be
if
you
can
fit
it
inside
an
l1
cache,
because
this
thing
is
just
testing
script
doing
one
request
after
another,
so
eventually
everything
will
be
in
cash
if
it's
not
too
big,
so
l1,
I
think,
is
64k
and
l2
is
like
512,
but
it
shares
between
all
the
cores.
L
A
I
suppose
that's
the
other
thing
is
whether
or
not
this
benchmark
that
is
being
run
actually
is
representative
of
what
a
real
client
would
do
in
terms
of
the
cash
right.
L
L
Our
processors
are
very
bad
in
doing
that,
so
I
suggested
the
middle
way
in
which
we
will
compute
a
request
on
the
nvme
side
on
the
snvmeq
before
they
come
to
the
fpga.
The
fpga
would
look
at
them
push
just
the
I
o
parameters
using
some
kind
of
ring
buffer
to
the
arm
cores.
The
ankles
would
do
crash
calculation
and
push
it
back
once
you
have
it.
It
can
process
the
io
from
the
cube.
L
Now,
if
we
assume
that
the
io
are
going
in
steady
rate
and
cueing
is
good,
then
you
should
be
able
to
do
close
to
the
number
of
what
calculation
we
can
do
and
the
arm
processor
would
be
doing
90
of
the
time
just
crash
calculation,
plus
some
maintenance
jobs.
Like
I
don't
know,
if
the
crash
map
have
to
be
updated,
they
have
to
do
something,
but
on
the
normal
flow,
that's
all
they
be
doing
just
crush
calculations,
so
the
cash
is
actually
going
to
be.
L
It's
going
to
be
reasonable
to
assume
that
the
cache
is
going
to
be
all
around
crash
calculation.
But
how
big
is
a
crash
map,
and
can
we
maybe
make
it
smaller
by
because
by
fitting
this
into
l1
cache
or
l2
cache,
maybe
even
l3ks,
that
might
be
a
huge
difference
in
performance.
L
Apply
to
all
grease.
L
A
The
the
other
thing
I
was
thinking
is
that
the
perf
theme
at
red
hat
had
some
changes
to
look
at
exactly
these
kinds
of
issues,
to
perf
to
look
at
exactly
these
kinds
of
issues,
and
it
might
be
worth
reaching
out
to
one
of
them
to
to
also
provide
guidance
on
some
of
this.
A
I
don't
remember
who
it
was
it
presented.
Someone
presented
some
of
their
work
a
couple
years
ago,
though,
to
the
the
south
community.
A
I'll
try
to
look
it
up
and
see
if
I
can
find
out
who
it
was
one
of
the
guys
over.
There,
though,
is
kind
of
an
expert
in
this
area,
so
that
might
be
worth
trying
putting
him
in
if
we
can't
find
it
ourselves.
A
All
right
well
we're
way
over
time.
Let's,
let's
wrap
this
one
up
and
yeah
done.
L
L
L
Four
megabyte
is
a
very
common
size
for
rgw
client,
but
so
I
said
so
that's
going
to
be
a
real
use
case,
but
I
was
convinced
by
the
other
mark
that
rw
is
too
complicated
for
them
and
they
should
start
with
rbd.
If
anything
prove
correct.
We
could
then
say
the
same
thing
should
apply
to
rw,
so
don't
ask
them
to
understand
rw,
but
aside
from
the
rw,
the
rest
is
okay
and
actually
one
more
thing.
There
is
something
I'm
talking
about
the
network
that
you
don't
have
to
wait
on.
L
If
there's
a
replica,
you
don't
have
to
wait
on
axe
from
replica,
and
then
I
realize
you
don't
it's
it's
it's.
Okay,
that's
very
easy
to
do.
A
Yeah,
like
yeah,
like
I
said
earlier,
the
the
big
win
for
me
would
be
if,
if
someone
can
really
take
a
look
and
figure
out
how
our
memory
applications
look
just
in
the
code,
that
would
be
that'd
be
huge
if
they
can
verify
where
we're
allocating,
where
we're
freeing
and
how
much
we're
doing
in
different
areas.
Some
of
that
that
prof
output
that
I
provided
in
the
ether
pad
as
kind
of
start
in
that
direction,
but
but
having
someone
really
run
with
it
would
be,
would
be
great.