►
From YouTube: Ceph Performance Meeting 2022-01-27
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
Okay,
so
a
little
bit
of
a
quiet
week
compared
to
previous
weeks,
quincy
freeze
has
happened.
I
think
people
are
still
trying
to
get
some
things
in,
but
a
little
bit
less
movement
for
prs
in
the
performance
front.
Ronin
has
a
pr
related
to
scrub
that
he
emailed
me
about
earlier
this
week,
specifically
basically
just
changing
some
of
the
chunking
that
happens
for
this.
If
I
remember
correctly,
we
don't
get
the
chunk
size,
oh
good
job,
you're.
Here
I
didn't
see
you
sorry.
B
Go
ahead,
you
talk.
Okay,
the
the
basic
idea
is
simple:
there
is
a
configuration
parameter
actually
too,
let's
say
the
size
of
the
chunks
we
are
using
when
scrubbing,
for
each
chunk.
We
are
requesting
the
maps
describe
maps
for
this
chunk
from
the
secondaries
on
the
replicas.
B
Now
the
chunk
size
is
currently
the
default
25,
which
is
very
small
when
considering
a
pg
that
might
include
the
millions
of
objects.
We
have
an
example
of
a
million
and
a
half,
and
what
I
suggested
is
that
we
should
create.
We
should
separate
the
sizes,
the
chunk
sizes
between
deep
scrubs
and
regular
shallow
scrubs
and
allow
larger
chunks
for
gloves
is
we're,
assuming
that
regular
scrub
is
less
effect
on
the
amount
of
high
or
the
effort
invested.
B
I
think
the
idea
is
accepted,
apart
from
the
fact
that
we
need
to
to
make
sure
that
we
do
not
create
a
problem
in
a
latency
issue
for
regular
client
requests,
and
I
would
remind
everyone
that,
up
to
a
point,
a
client
request
preempt
a
running
scrubs
up
to
a
thing
five
times.
It's
a
configuration
point.
A
And
I
know
that
that
pr
for
for
adding
scrub
testing
to
the
cvt
is,
is
pretty
big,
but
did
you
did
you
look
at
it
over
and
over
does
it
look
like
it
would
be
helpful
for
you
for
testing.
B
A
I
was
just
gonna
offer
help
too.
I
think
you
can
probably
help
him
more
than
I
can
at
this
point,
since
you
know
that
could
much
better
than
I
do,
but
maybe,
as
a
part
of
this,
we
can
get
that
merged,
which
is,
is
very
much
my
fault,
I'm
sorry,
but
but
it
sounds
like
it's
a
very
useful
pr,
so
this
maybe
this
is
a
good
test
case
to
actually
do
that.
C
Yeah
I
had
actually
opened
it
in
draft
state
because
I
had
been
making
a
lot
of
changes
to
it,
but
I
also
wanted
to
share
it
with
sridhar
to
show
him
the
way
I
was
testing
scrub,
but
I
think
now
I
can
raise
an
actual
pr,
because
that's
what
we've
been
using
for
qos
testing
and
it
does
what
ronan
wants
to
do
or
run
scrub
with
client
io,
and
maybe
we
can
see
what
stats
he
needs
and
what
could
that.
D
Go
ahead
and
sorry,
no,
I
just
said
the
same:
it's
great!
Okay,
great,
wonderful,
all
right!
Let's
see
well
moving
on,
then
the
other
new
pr.
A
I
saw
was
this:
is
setting
tracing
to
be
in
compiled
by
default?
I
don't
think
deepika
is
here.
The
the
gist
of
it
is
that
there
is
a
little
bit
of
performance
overhead
by
doing
this,
it
sounds
like,
but
but
not
too
bad,
and
the
benefit
would
be
that
we
could
have
users
and
and
customers
more
easily
be
able
to
set
tracing
to
be
enabled
when
they
run
into
a
problem.
A
And
of
course
there
is
overhead
when
the
tracing
is
enabled
but
simply
having
it
compiled.
Then
it
sounds
like
the
the
impact
is
is
quite
low,
if
I
remember
correctly
from
the
pr,
maybe
it's
around
one
percent
or
less
so
that
that
potentially
could
be
worth
doing
and
I
think
dipika
reviewed
it.
I
don't
know
if,
if
that's
set
now
to
be
enabled
by,
if
that's,
if
that's
actually,
you
know
if
we're
planning
to
merge
it
or
not,
I
guess,
but
but
anyway,
that's
that's
there.
E
So
I've
I've
been
following
this
one
and
I
there's
so
originally
when
they
were
doing
benchmarks.
They
were
seeing
a
big
performance
hit
on
the
osd,
because
there's
a
lot
of
shared
pointers
that
come
with
each
of
this
bands
for
these
traces,
and
this
pr
is
basically
removing
a
ton
of
spans
from
the
osd.
E
F
A
Would
you
say
a
lot
less
detail
like
like
how
much
are
you
talking
casey.
E
I
I
don't
know
exactly
how
many
sub
spans
there
were
initially,
it
might
have
been
like
20
or
a
dozen.
Maybe,
and
now
I
think,
there's
just
one
or
two.
E
A
G
Okay,
yeah,
I
think
in
the
short
term.
I
think
it's
okay,
if
you're
getting
some
of
those,
but
we
may
probably
want
those
back
in
the
longer
term.
I
guess
there's
just
kind
of
two
audiences
that
there
is
targeting
and
thank
you
thank
you.
All's
mainly
focusing
on
phrases
that
make
sense
from
the
either
perspective,
but
for
developers
perspective
and
being
able
to
like
profile
things
and
see
pinpoint
where
things
are
going
and
wrong
we're
going
to
definitely
want
more
than
two
bands,
the
osd
so
I'll
figure
that
out
in
the
future.
G
But
for
the
short
term.
I
think
it's
okay,
if,
if
we
reduce
that
just
to
get
tracing
it
into
users
hands,
because
today,
it's
not
even
compiled
in
by
default.
G
A
Okay-
and
you
think
we
could
do
that,
we
could-
we
could
have
variable
span
without
introducing
any
additional
compile-time
overhead
or
not.
Sorry,
I
have
one
time.
G
Thank
you.
You've
always
been
discussing
with
the
upstream
libraries
about
how
to
get
rid
of
the
shared
pointer
piece.
I
think
that
if
that's
eliminated
an
interface,
it
wouldn't
matter
how
many
spans
we
had
at
compile
time
or
or
runtime,
but
we
need
to
add
the
ability
to
configure
those
at
the
stuff
layer
to
try
to
turn
them
on
and
off.
We
wanted
them
to
have
different
views
for
different
use
cases.
G
A
Well,
neat,
I
mean
it's
exciting
right
because
you
know
having
having
good
tracing
and
stuff
would
be
amber
and
we've
we've
talked
about
it
for
years,
so
this
is
yeah
yeah.
G
A
G
A
Cool
all
right,
well
sounds
like
sounds
like
we're:
making
progress
at
the
very
least
on
it.
That's
really
good!
A
So
nothing
closed
this
week
that
I
saw
please
let
me
know
if
I
missed
it,
but
I
didn't
see
anything
performance
related
anyway,
updated
though
it
looks
like
this
pr
vernon
that
you
reviewed
around
using
thread
local
pointer
variables
to
save
the
shard
that
that's
now
in
whitby
retesting,
so
I
think
you
had
to
prove
that
previously.
I
think
any
any
other
comments
on
that.
One.
B
G
A
All
right,
let's
see
radix
pr
for
introducing
huge
page
based,
read
buffers.
I
don't
think
radix
here.
I
didn't.
I
think
it's
it's
matched.
H
A
A
A
A
All
right
next,
the
this
actually
probably
should
be
no
movement.
This
first
pass
that
omash
bench
test
neha.
I
apologize
is
still
just
sitting
on
the
back
burner,
but.
A
Yeah,
having
said
that,
it's
actually
something
that
we've
been
really
nice
right
now
for
quincy
testing,
I'm
really
tempted
to
actually
go
through
and
run
it
anyway,
like
just
apply
the
pr
and
then
run
it
from
these
tests.
Yeah.
F
A
Yeah,
I
still
you
know
if
we,
if
we
make
it
generic
for
for
existing
osds,
it's
just
gonna
have
to
look
really
really
different
than
it
looks
now,
whereas
we
could
just
separate
this
off
into
like
a
separate
g-test
tool
or
something
like
that
looks
very
similar
to
what
we've
got
right
now
and
then
and
then
just
you
know,
have
it
there
we
could
even
pretty
easily
backport
it
all
the
way
to
nautilus.
A
H
Yeah,
sorry,
well
so
yeah
it's!
It
looks
like
it's
ready
for
review,
so
I'm
calling
for
reviewers
and
well.
I
run.
H
Another
run
which
looks
good
for
me
just
a
few
that
drop,
which
I'm
double
checking.
A
All
right,
very
good,
very
good.
That
was
it.
That
was
all
I
saw
this
for
this
week.
I
know
everyone's
working
harder
in
quincy,
so
that's
that's
completely
understandable,
but
we
haven't
seen
some
of
these
other
ones
any
anything
I
missed
from
anybody.
A
All
right,
if
not,
then
the
the
discussion
topic
I
have
for
this
week,
I
put
it
in
the
ether
pad,
so
you
know
feel
free
to
take
a
look
here,
but
the
best
of
it
is
that
we
are
trying
to
do
some
testing
for
quincy
on
our.
A
We,
we
have
a
fairly
decent
number
of
amd
roam
nodes
now
in-house
we
have
ten
of
them
so
we're
using
that
for
this
release
for
going
back
and
and
doing
some
comparison
tests
to
previous
releases
and
we're
seeing
some
different
behavior
than
we've
seen
in
the
past
with
our
intel
nodes
for
reference,
I
put
a
link
to
some
of
our
our
previous
tests
on
our
official
analysis
nodes
where
we're
seeing
really
consistent
improvement
kind
of
going
from
from
nautilus
to
octopus
to
at
that
time,
master,
which
was
you
know,
kind
of
the
movement
towards
specific.
A
This
is
a
little
complicated
because
we're
also
looking
at
one
versus
two
osds
per
device
on
these
tests
that
was
requested
time,
but
but
nevertheless,
we're
seeing
general
improvement
here
for
the
most
part
on
some
of
the
new
testing
that
I've
just
been
doing
on
our
amd
nodes.
Our
newer
aimed
nodes-
it's
it's,
sometimes
we're
seeing
improvement
and,
in
some
cases
we're
seeing
what
appears
to
be
fairly
significant
regressions
going
to
octopus
and
pacific.
A
I
was
saying
in
the
core
meeting
that,
depending
on
how
you
look
at
this,
this
is
good
news
or
bad
news.
We've
seen
some
reports
on
the
mailing
list
of
people
saying
that
pacific
was
slower
for
them
than
novelis
and
we
were
not
able
to
produce
it
in-house
on
our
intel
machines
and
in
fact
you
know
this
has
been
kind
of
the
case
in
both
on
official
analysis
on
our
test
cluster,
along
with
tests
that
have
been
done
by
the
the
dfg
workload,
team
and
and
and
other
folks.
A
So
maybe
the
good
news
here
is
that
maybe
this
is
now
allowing
us
to
produce
some
of
this,
so
we
can
go
back
and
figure
out
what
it
was
that
did.
This
is
sidetracking
us
a
little
bit
from
the
quincy
testing,
which
is
what
the
real
purpose
of
all
of
this
was.
But
but
you
know
it's
it's
never
never
too
late
to
go
back
and
figure
out
what
you
might
have
done
wrong.
So
you
know,
hopefully
this
will
help
us
understand.
A
If
maybe
there
are
some
some
significant
differences
between
these
two
different
test
platforms
that
we
now
have
access
to.
Probably
before
I
dig
back
into
this
I'll
run,
some
initial
tests
on
the
quincy
freeze
just
to
see
how
we're
comparing.
I
suspect
that
some
of
these
tests
were
gonna,
look
better
due
to
gabby's
work
on
reducing
the
amount
of
data
that
we
store
in
roxdb.
A
That
seemed
to
be
a
pretty
big
win
on
in
the
right
path,
especially
for
small
random
rights,
and
that's
one
of
those
cases
where
we
we
did
see
some
regression
specifically
going
from
novelist
to
octopus
on
this
platform,
so
we'll
see
what
happens
there.
But
in
the
event
I
wanted
to
share
this
with
folks.
A
So
they
can
see
you
know
kind
of
what
I'm
seeing
on
the
ground
right
now,
it's
possible
that
that
you
know
this
could
change
over
the
course
of
the
next
week
or
two
as
I
try
to
dig
into
what's
going
on,
but
that's
that's
kind
of
what
I'm
seeing
any
any
questions
on
any
of
this.
A
One
thing
I
want
to
mention
is:
I
don't
think
this
is
due
to
glyphose
buffered
io.
It
looks
like
we
backported
those
changes
to
all
the
different
releases
that
I've
tested
here,
so
I
believe,
we're
using
bufferedio
in
all
cases,
not
direct
io.
A
All
right,
well,
then,
alex.
Would
you
like
to
take
over
and
talk
about
what
you're,
seeing
ttl
and
rock
series.
I
I
And
what
we've
noticed
over
the
last
few
months
is
that
we
see
the
latency
steadily
increase
and
no
matter
how
much
iops
we
have
in
the
back
end.
It
happens
right.
So
we
have
some
cluster
with
ssd
index,
some
customer
with
nvme
index,
and
it
still
happens,
and
the
problem
was
also
exacerbated.
While
we
were
trying
to
delete
some
very
old
and
extremely
large
shards,
they
are
so
large
in
fact,
that
we
cannot
use
urls
because
we
admin
comment
for
them
because
they
will
touch
the
cluster.
I
So
what
we've
been
doing
is
deleting
np
at
a
time
every
x
seconds
and
as
we're
doing
that
we're
also
seeing
a
latency
increase.
I
So
what
has
been
happening
in
our
production
environment,
for
it
is
that
at
some
point
in
time,
one
index
osd
starts
to
pile
up
on
slow
requests
and
that
blocks
our
entire
clusters
after
after
a
while,
so
we've
been
looking
at
logs
for
it,
and
we
kept
seeing
slowness
on
the
omap
iterator,
taking
like
over
10
seconds
to
list
a
few
keys,
and
so
we
started
to
look
at
that
and
found
out
that
it
was
extremely
likely
caused
by
tombstones.
I
So
it's
such
a
problem
that
even
on
cluster,
where
we
don't
have
this
large
shard
deletion,
we
have
to
compact
three
times
a
day
and
even
then,
as
customer
workloads
start
to
ramp
up
the
compaction.
I
Timed
outs
like
we
hit
the
osd
command
thread,
timeout
or
suicide
timeout,
or
something
like
that
which
caused
further
issues.
So
over
time.
It
gets
worse
and
worse
and
worse.
So
we
looked
at
rockstb
and
found
these
two
options,
one
that
is
available
on
starting
with
specific,
with
periodic
compaction
seconds
that
force
compaction
regularly,
but
because
we
are
unknown
still
in
production.
I
We
started
to
look
instead
at
ttl,
which
will
periodically
as
well
look
for
the
tombstones
and
compact
trigger
compaction
based
on
on
to
to
remove
the
tombstones,
so
I've
run
some
benchmarks
on
that.
Let
me.
I
Sorry
here,
oh
so
before
the
benchmark,
I
think
I've
saw
in
the
mailing
list
as
well
many
times.
People
having
this
issue
with
index
and
the
recommendation
so
far
has
always
been
oh
compact
regularly.
But
we
also
found
out
that
in
our
old
der
cluster
that
used
the
ssd,
a
compaction
of
an
osd
live
compaction
would
just
bring
the
osd
down.
I
So
that's
why
we
started
to
look
at
different
options
and
ttl
seemed
to
be
beneficial.
So
I
ran
some
benchmarks,
one
that
is,
and
they
are
very
specific
to
our
environment,
and
that's
why
I
want
to
have
a
discussion
and
see
if
people
could
find
this
valuable
or
if
we
need
to
brand
something
else
for
you
guys
so
on
the
benchmark.
When
we
delete
ten
thousand
kilo
time
and
up
up
to
ten
thousand
key
per
second,
you
can
see
with
the
default
option
of
seth.
I
You
can
see
that
latency
increased
tv
over
time
and
then
there
is
a
compaction
events
isu
and
it
goes
down,
but
not
far,
far
from
the
baseline,
where
it
was
and
then
goes
back
up,
and
it
kind
of
you
can
have
a
trend
where
the
latency
slowly
increase
over
time.
I
Whereas
if
you
use
ctl,
it
steadily
still
latency
still
steadily
increases.
But
after
a
while
it
just
drops
down
back
to
the
nominal
value
that
you
had
at
the
beginning
of
the
benchmark
and
for
other
tests
that
are
more
there's
still
not
very
much
production
like.
But
basically
what
I'm
trying
to
do
with
this
test
is
just
generate
the
worst
case
scenario
possible
just
to
extrapolate
the
issue
as
much
as
possible
to
better
understand.
I
I
So
when
tdr
starts
to
run,
you
see
this
huge
spike
in
latency
and
then
when
it's
done,
everything
goes
back
down
quite
nicely
and
I
think
the
spike
is
just
use
the
concurrency
of
the
all,
because
in
the
benchmark,
I'm
basically
running
single
pull
single
writers
object
with
a
larger
map,
so
yeah
pretty
much
worst
case
scenario,
and
so
it's
been
beneficial
for
us
in
index.
I
have
a
colleague,
josh
josh,
bargain
they're
deployed
in
production.
We
can
talk
about
that
as
well.
I
It's
been
very
yeah,
very
beneficial.
We
say
that
I
think
six
hours.
I
don't
think
that
it's
gonna
be
that
beneficial
for
non-rgw
workloads
to
sell
it
like
six
hours,
but
I
did
notice
that
pacific
by
default,
well,
the
rockdb
version
pacified
by
default
has
it
enabled
at
30
days.
So
I
don't
know
if
that
helps
with
pg
log
and
all
of
that
yeah
I
mean
josh.
If
you're
here
I
think
it
was.
Do
you
want
to
talk
about
production
environments.
J
Yeah,
you
can
hear
me
right
great,
okay,
yeah,
so
we've
been
trying
this
in
production,
it's
helped
quite
a
bit.
We
actually
did
have
some
clusters
that
were
nvme
index,
based
that
we
were
running
compaction
regularly.
They
were
stable
enough
to
do
compaction,
but
their
compactions
were
starting
to
go
up
to
like
the
10-15
minute
mark
for
a
run
and
we're
steadily
increasing
over
time.
So
that
just
was
not
a
sustainable
approach.
J
We
tried,
we
actually
tried
ttl
at
30
minutes
to
start
with,
and
that
worked
fine,
but
the
the
biggest
downside
we've
had
from
ptl
so
far
is
the
right
amp,
which
is
not
surprising.
That's
basically
what
the
rocksdb
docks
warn
about
is
that
your
right
app
goes
up
and
with
like
a
30
minute
ttl.
J
Our
right
amp
was
such
that
it
would
probably
torch
an
ssd
in
about
three
years,
just
due
to
drive
rates
per
day,
so
dropping
that
down
from
30
minutes
to
six
hours
reduced
the
rate
out
by
about
a
factor
of
five.
So
that's
tons
of
headroom
there
at
that
point,
and
it's
actually
been
really
stable
too,
on
our
ssds.
J
It
probably
took
on
our
biggest
some
of
our
bigger
indexes.
It
probably
took
about
two
hours
for
no,
I
want
to
say
more
three
or
four
hours
for
ttl
to
actually
clean
up
the
database
of
all
the
stuff
that
was
accumulating
over
the
years
that
it's
been
running
and
when
it
did
that
there
are
several
cases
where
we've
seen
latencies
go
down
for
index
operations,
but
I
guess
the
more
important
thing
is,
and
this
is
something
we're
going
to
be
learning
soon.
J
Some
of
our
big
deletion
operations
cleanup
operations.
Are
they
going
to
be
better
and
we,
like,
we
literally
rolled
this
up
to
our
worst
case
cluster
yesterday
or
finished
yesterday.
So
we
just,
we
don't
have
that
data
yet,
but
in
terms
of
stability,
it's
been
looking
really
good
and
it's
outright
replaced.
Our
compact
jobs
on.
A
Do
do
you
happen
to
know
compared
to
the
default
ttl,
what
kind
of
rate
amp
increase
you
saw
with
like
six
hours.
A
Or
or
even
whatever,
you've
measured
what
what
is
did
changing
that
increase
your.
J
Ramp
and
like
it
absolutely
did,
and
so
I
actually
don't
have
the
number
for
like
no
ttl
versus
ttl
what
the
road
amp
looks
like,
but
it
definitely
increases
significantly
like.
Basically,
the
drives
were
sitting
at
0.01
drives
rate
per
day,
and
so,
if
you
imagine,
dropping
that
down
from
0.01
drive
rates
per
day
to
whatever
is
torching
a
drive
in
three
years
and
actually
the
math
in
my
head
right
now,
it
was
like
a
10
probably
well
over
a
10x
right
now,
wow,
okay,
okay,
but
the
thing
is
our:
like.
J
The
issue
is
not
disk
access.
The
issue
was
entirely
the
cpu
usage
incurred
by
accumulating
tombstones
over
time.
A
J
Yeah
and
what
I,
what
I
see
in
the
grass
is
like
we'll
get
a
we'll
get
bursts
of
compaction.
Now
with
ttl,
I
said
it:
six
hours,
obviously
you're
getting
a
burst
at
least
every
six
hours.
If
you've
got
a
steady
workload
coming
in,
it's
maybe
a
little
bit
more
often
than
that,
but
it'll
run
for
maybe
10
minutes
and
20
minutes
of
compaction
every
once
in
a
while
was
the
other
thing
I
was
gonna,
say
right
yeah,
so
the
other
thing
is
I.
I
don't
think
this
is
good
enough.
J
I've
really
been
wanting
to
turn
off
blue
fx.
I
o.
When
I
can
the
buffer,
I
o
actually
really
hurts
us.
We've
been
turning
it
off
in
as
much
clusters
as
we
can,
because
I
think
I've
mentioned
this
at
this
meeting
before,
but
once
you
have
a
dmcrip
layer
in
there,
it
actually
causes
like
a
huge
iops
amplification
for
rights,
and
that
really
hurts
us,
especially
for
our
ssd
based
systems
where
they
just
can't
take
that
level
of
layoffs.
A
We
really
need
to
figure
out
why
roxdb
is
not
properly
doing,
read
ahead
and
and
like
reading
from
the
cache
when
it
should
be
and
is,
is
like
relying
on
the
page
cache
for
it
it's
ridiculous,
but
that
would
if
we
could
fix
that
we
can
just
get
rid
of
buffer
io.
I
think
and
go
back
to
doing,
direct.
J
Io
yeah
and
it's
something
that
I
would
love
for
us
to
be
able
to
spend
some
time
on
too,
but
I
don't
feel
like
until
until
we're
at
pacific
we're
still
a
nautilus
until
we're
at
the
pacific.
I
don't
really
want
to
be
spending
that
much
time
on
it.
Yet.
A
J
A
Well,
hey,
this
is
fantastic.
This
is
really
really
good.
This
is
the
kind
of
stuff
that,
like
we,
we
don't
see
as
often
right,
because
we
don't
have
the
kinds
of
clusters
that
you
guys
do.
So
it's
it's
really.
This
is
this
is
excellent.
I
think
the
the
trick
will
be
figuring
out
what
the
right
balance
by
default
is
of.
You
know
doing
regular,
regular
compactions
like
this
versus
right,
amp
versus
you
know.
A
You
know
what
what
makes
sense
for
people,
but,
but
certainly
more
so
than
what
we
have
right
now
sounds
like
it's.
It's
very
reasonable,
yeah.
J
And
it's
so
workload
dependent
right,
I
mean
we
have
we
like
alice
said
we
do
have
workloads
that
are
just
so
like
delete
and
replace
heavy
that
this
makes
a
big
difference.
We
saw
disk
usage
drop
by
10
to
20
when
we
turn
this
setting
on.
So
it
tells
you
how
much
redundant
data
there
is
or
undeleted
data
there
is
sitting
in
our
in
our
databases.
Right,
not
everybody's
gonna
have
a
workload
like
that.
If
your
workload
is
right
once
read,
many
you're
not
gonna
benefit
from
ttl
at
all.
J
They
actually
their
list
performance
just
gets
brutal
because
they
get
into
this
cycle,
and
the
thing
is:
roxdb
is
only
compacting
away
tombstones
when
it
has
to
when
a
level
is
full
or
whatever
right,
and
if
they're
not
deleting
enough.
Maybe
the
files
that
hold
their
tombstones
aren't
even
getting
attention.
It
could
be
other
files
that
are
actually
being
the
ones
being
impacted,
so
something
like
this
at
least
tells
us
a
guarantee.
J
No
tombstone
is
going
to
last
longer
than
six
hours.
Six
hours
is
probably
too
aggressive
for
standard
workloads
because,
like
if
you
don't
actually
see
symptoms
for
like
a
few
days,
you
could
run
with
like
a
one
day,
ttl
or
even
three
day,
ttl
or
whatever
right.
It
doesn't
have
to
be
six
hours.
Just
for
some
of
the
things
that
we
want
to
do
on
the
on
these
indexes.
Six
hours
is
what
makes
sense
for
us
I'd
love
to
do
it
at
30
minutes,
but,
like
I
said,
the
right
amp
is
way
too
high.
A
Yeah
when
we
were
saying
really
really
bad
behavior
with
the
like
bulk
delete
stuff
a
while
ago,
I
think
igor
we
had.
We
talked
about
even
like
trying
to
do
compaction
on
like
an
iteration
right,
like
you,
iterate
over
stuff
and
and
maybe.
C
A
You
or
or
delete
stuff
right,
maybe
on
the
some
some
number
of
ops
you
you
end
up
going
back
and
doing
compaction
rather
than
based
on
the
amount
of
data
that
that
you
have
waiting
to
compact.
But
I
don't
know
if
that
makes
more
sense
than
just
doing
like
a
deal
type
thing.
J
You
know
yeah
one
of
the
things
that
alex
and
then
another
one
of
our
colleagues
matt
van
meelen,
looked
into
a
little
bit
is
there
are
roxdb
calls
for
saying,
like
compact
over
range
or
if
you
do
a
deletion.
I
can't
remember
how
it
works.
It's
something
like.
If
you
do
a
deletion
over
a
range
you
can
tell
it
to
like
compact
all
the
way
through
all
the
levels
down
or
something
like
that.
J
J
It
didn't
sound
as
as
much
of
a
guarantee,
as
you
think
it
might
be,
and
the
issue
is
that
your
client
has
to
somehow
give
that
hint
and
that's
not
always
possible,
because
it's
not
like
rgw
is
like
reaching
into
the
internals
of
rocks
to
be
on
the
index
osd's
right,
and
so
I
think
it
gets
really
complicated
yeah
to
try
to
do
stuff
like
that.
A
J
J
So
I
don't
know
if
periodic
compaction
is
actually
just
running
a
filter
and
then
decides
on
whether
or
not
it
should
compact
a
file,
and
if
that's
true,
maybe
it
has
less
right
app
like
I.
I
have
a
hard
time
from
the
documentation
predicting
what
ross
tv
settings
are
going
to
do.
So
we
also
need
to
try
it
and
see
what
happens.
A
You
guys
might
frankly
be
more
experts
on
this
than
anyone
I
mean
like
adam,
and
I
have
tried
to
kind
of
look
into
some,
but
not
not
a
ton
and-
and
you
know
already
you're
you're
teasing
out
details
here-
that
I
think
you
know
we
don't
necessarily
have
any
more
expertise
on.
So
you
know
absolutely
come
and
report
what
you
what
you
find,
because
this
is
really
good.
L
A
point
we
worked
with
a
startup
called
spdb,
which
currently
they
provide
roxdb,
dropping
replacement.
They
all
they
promise
to
open
source
yeah
and
during
our
work
they
showed
on
roxdb
a
benchmark,
tremendous
improvement.
But
when
we
put
it
into
surf,
we
saw
that
it
didn't
help
us,
because
it
wasn't
the
bottleneck
in
what
we
were
doing,
but
they
had
like
10x
performance
improvement
in
roxdb.
L
Doing
just
roxdb
benchmarks
could
be
that
this
is
a
case
where
such
a
thing
could
help,
because
it's
not
it
you
know,
because
when
they
become
the
bottleneck,
we
just
couldn't
find
a
real
test
case.
We
tried
a
real
use
case
where
we
set
where
the
bottleneck
and
their
improvement
actually
made
a
difference.
L
A
I
talked
to
those
guys
just
a
little
bit.
Adam
did
most
of
the
work
talking
to
them
and
working
with
them.
I
know
yeah
yeah,
I
I
don't
remember
hearing
when
I
talked
to
them
that
this
was
something
that
they
had
tackled.
You
know
as
a
a
big
performance
bottleneck
compared
to
you
know
typical
rocks
db,
specifically
looking
at
the
compaction,
behavior
and
performance
of
tombstones
under
cases
where
you,
you
have
a
lot
of
accumulated
junk.
L
I
I
I
I
talked
to
them
a
lot,
but
I
don't
yeah.
I
don't
know
whether
I
took
them
without
have
coordinated
the
work
on
this.
I
don't
know
I
don't
have
a
clue
whether
they
they
tackle
this
problem.
I
know
that
they
they
claim
that
they
reduce
their
right
amplification
significantly.
So
if
one
of
the
problems
here
is
right,
amplifications
because
of
what
we
are
doing,
they
claim
that
they
have
significant
improvement
in
this
again.
L
A
Our
right
amplification
already
that's
that's
kind
of
the
whole
reason
why
we've
got
these
giant
mem
tables
and
giant
redhead
log
buffers
right.
You
know,
there's
there's
always
this
desire
to
make
them
smaller,
which
I
completely
understand
and
agree
with,
except
for
the
fact
that,
but
because
of
the
way
that
pg
log
works,
we
end
up
with.
You
know
all
of
this.
This
temporary
data
that
gets
moved
into
level
zero
and
moved
into
the
database.
A
If
you
have
very
small
buffers,
so
we've
kind
of
like
you
know,
gone
about
as
extreme
as
we
can
in
this
direction
of
trying
to
reduce
right
amplification
and
now
it's
kind
of
the
question
of
well.
How
much
do
we
bring
back
to
make
behavior
nice
if
they
can
showcase
that
they
can
introduce
all
of
these
nice
behaviors
that
we
want,
while
keeping
the
right
amplification
low?
You
know
that
is
valuable,
but
that
that's
the
big
question.
L
If
I'm
correct
it,
it
was
some
time
ago
and
I
don't
remember
all
the
details,
but
I
think
that
what
they
actually
did
that
they
implemented
the
compaction
differently
in
a
more
efficient
way.
So
just
from
the
sound
of
what's
going
on,
because
I
didn't
fully
understand
that
the
exact
problem
that
that
josh
and
alex
explained,
but
just
from
the
the
sound
here
it
seems
like
it,
touches
the
same
points.
But
you
know
we
tried
it
in
the
past.
L
L
L
I'm
not
sure
that
it's
with
the
version
that
you
see
not
notable
so
it
could
be
that,
but
maybe
they
will
do
it
for
us
because
they're
in
good
relationship
with
us,
maybe
if
we
give
them,
give
them
a
version.
Maybe
they
will
build
the
version
for
us,
but
if
you
think
it's
worth
testing
or
checking
this,
I
could
start
talking
with
them,
see
what
we
can
do
do
you
do.
You
know
what
their
their
timeline
is
for
open
sourcing
it?
L
No,
I
and
I
I
can
check.
I
can
check
it,
but
I
don't
know
they.
They
had
several
meetings
with
us
us
as
red
hat
on
what
what
economic
models
they
could
use.
What's
the
benefit,
how
they
do
it
all
these
kind
of
things
they
have
some
grand
plan
to
do
something
much
larger
than
this
rocks
db
improvement
and
they
they
have
a
lot
of
incentive
to
open
source
it,
but
I'm
not
updated
at
least
for
the
last
three
months,
so
I'll
need
to
check
sure.
A
And
and
adam
might,
I
think,
he's
on
pto
this
week,
but
I
think
he's
back
next
week,
so
I'm
sure
he'll
have
he'll
have
lots
of
feedback
to
give
on
on
what
he
saw
as
well.
So
maybe
we
should.
We
should
wait
until
he's
he's
back,
so
we
can
fight
his
feet
back.
L
Okay-
maybe
you
could
raise
this,
you
know
put
it
on
the
agenda,
for
you
know
two
weeks
from
now
and
we'll
try
to
figure
out.
What's
going
on,
yeah
yeah.
A
We
usually
I
do
this
kind
of
week
by
week.
Would
you
would
you
just
send
me
a
reminder
if
you
wanted
it
for
for
the
week
after
this
week,
okay,
cool
cool.
H
I
So
basically
you're
saying
running
it
even
it
runs,
but
there
is
no
tombstone
to
to
compact
yeah
yeah.
That's
a
good
question.
We
I
haven't
thought
of
observing
that.
J
Yeah
we've
been
entirely
focused
on
our
index
use
case,
so
we
haven't
looked
at
like
what
happens
in
rbd
cluster.
If
we
implement
this,
for
example-
and
I
don't
think
we
would,
the
right
amp
would
be
unacceptable
there
for
drive
life
and
that
sort
of
thing
we
think
but
yeah.
No,
it's
a
good
question.
We
don't
know
what
happens
in
by
quote-unquote
normal
workload
where
there's
low,
tombstoning
or
no
tombstoning
and
just
rights
of
unique
data.
J
Yes,
like
the
the
my
my
very
rough
understanding
from
the
docs
and
again,
I
don't
know
how
much
I
trust
the
docs
versus
like
digging
into
the
code,
but
basically
when
a
file
gets
written
in
a
level,
that's
not
the
bottom
level,
it
gets
a
timestamp
and
then
the
ttl
is
just
like
a
periodic
check
of
that
timestamp.
J
So
if
you
compact
any
ways
if
the
file
gets
shoved
down
a
level
or
something
like
that,
then
it's
likely
that
timestamp
would
change
it's
not
going
to
get
ttls.
Now,
like
I
said,
almost
everything
seems
to
end
up
in
the
bottom
level.
So
what
I
don't
know
like
the
big
question
in
my
head
is:
is
the
timestamp
when
that
file
came
into
existence
or
it's
a
time
stamp
when
that
file
was
last
updated
in
some
way
right,
that's
a
fairly
critical
difference.
J
A
J
J
So
we
were
concerned
that,
like
gtl,
because
we're
going
to
go
and
turn
this
on
on
an
osd,
that's
been
running
for
three
years,
we're
concerned
about
hours
of
ttl
compaction,
basically
starving
level
size
based
compaction,
but
it
doesn't.
If
you
look
at
the
loop,
it
basically
says:
do
I
need
to
compact
for
this
reason,
for
this
reason,
at
the
very
bottom,
it's
ttl
I've
got
no
other
work
to
do.
Okay,
I'm
gonna
schedule
some
background.
A
Yeah
keep
keep
us
in
the
loop
on
this.
This
is
this
is
super
super
good.
J
Yeah
for
sure,
we'll
we'll
be
gaining
more
experience
with
this
in
the
coming
weeks,
as
we
start
to
reenable
some
of
our
more
delete
heavy
workloads
and
see
how
it
actually
does
in
the
wild
cool.
A
All
right:
well,
we've
got
about
five
minutes
before
we're
at
the
hour
any
anything
else
from
anyone.
This
week.
A
All
right
well,
then,
have
an
excellent
week.
Everyone
thank
you.
So
much
for
coming
and
we'll
talk
again
next
week,
bye,
guys
bye.