►
From YouTube: 2016-DEC-14 :: Ceph Performance Weekly
Description
Weekly collaboration call of all community members working on Ceph performance.
For full notes and video recording archive visit:
http://pad.ceph.com/p/performance_weekly
B
B
Think
a
lot
of
stuff
is
actually
have
on
hold
until
crackin.
It's
released,
though
yeah
a
lot
they
have
more
interesting
stuff
here
is,
is
kind
of
waiting
on
that
so
I
guess
once
once
the
best
released,
hopefully
we'll
see
some
movement
on
this,
but
for
today
at
least
there's
a
fair
amount
of
self
here,
that's
just
kind
of
waiting.
B
Alright,
let's
see
okay,
there's
a
couple
of
different
looks
like
hold.
Topics
are
being
adhered
for.
Discussion,
but
I'll
just
kind
of
quickly
go
to
the
ones
I've
got
then,
and
the
other
people
can
probably
talk
for
more
of
the
time
here.
So
over
the
past
week
we
ran
a
rbd
on
ec
overwrite
tests,
both
on
spinning
disks
nvme
and
looking
at
both
file
store
and
glue
store,
the
kind
of
general
gist
of
this.
B
The
links
in
the
etherpad
here,
if
you
won't
actually
look
at
the
graphs
here
to
kind
of
scroll
over
to
the
right,
but
the
overall
gist
of
it
is
that
blue
store
actually
does
really
really
well
with
rbd
and
ec
over
rights
for
large,
sequential
writes,
which
is
kind
of
the
case.
That
you'd
expect
this
to
do
the
best
in
or
you've
got
just
big
streaming
right,
I
Oh
happening,
and
it's
actually
quite
fast
and
I'm
in
via
me,
I
think
it
was
about
twice
as
fast
as
3x
replication
and
for
hard
drives.
B
I
think
there's
about
fifty
percent
faster.
So
that's
that's
the
good
news,
that's
kind
of
a
good
common
use
case
for
parade
anyway
and
we're
seeing
some
of
the
same
kind
of
benefits
you
see.
In
those
cases.
We
also
see
a
lot
of
the
same
kind
of
problems
sequential
well.
Actually,
maybe
new
problems
sequential
reads
are
slower,
and
that
makes
sense,
given
the
way
that
stuff
works.
B
B
Small
secret,
small,
random,
writes
are
slower,
and
that
also
makes
sense
just
know
that
that
this
all
works
you're
doing
a
lot
of
extra
work,
so
blue
stores
is
slower,
but
maybe
maybe
tolerably
solar
and
hoping
we
can
make
it
go
faster
than
it
is
right
now.
But
but
it's
is,
you
know,
definitely
slowed
in
application
file
stores,
pretty
atrocious,
two
continents,
it's
a
it's!
It's
dramatically:
smuts
lower
core
smaller
engine
rights
versus
replication,
so.
C
I
want
to
take
a
second
to
put
this
in
context,
so
I
think
the
the
only
number
that
we
should
pay
much
attention
to
right
now
is
the
case
or
overriding
a
full
stripe,
because
that
is
looking
roughly
what
it's
going
to
look
like
sort
of
going
forward
and
in
the
in
the
blue
store
case.
It's
you
write
a
full
stripe.
It
is
doing
a
clone
range
on
all
the
previously
overwritten
data,
so
it's
generating
a
little
bit
more
metadata
than
we
ultimately
eventually
want
to
do,
but
it
probably
isn't
a
dramatic
difference.
C
File
store
is
never
ever
going
to
look
any
good
for
the
EC
overwrites,
because
it's
literally
copying
reading
data
and
then
writing
it
somewhere
else
and
then
overriding
it.
It's
just
not
designed
to
be
able
to
handle
this
so
I
think
we
can
ignore
that
the
plan
for
between
now
and
luminous
is
to
make
the
case
what
we're
writing
doing.
C
Small
rights
that
are
smaller
than
an
entire
stripe,
more
effective,
efficient
by
hopefully
updating
only
some
of
the
shards
in
the
stripe
and
then
making
peering
clever
enough
to
handle
the
case
for
some
of
the
shards
saw
some
updates
and
some
of
them
some
other
updates
and
the
logs
get
sort
of
merchant.
There's.
D
C
Well
and
hopefully
make
blue
storm
behalf,
well,
possibly
a
blues
or
do
the
rollback
thing
so
that
its
metadata
management
in
that
for
that
clone
case
or
whatever
it's
more
efficient
other
than
that
I'm,
not
yeah,
hope
does
to
see
I'm,
not
sure
if
that's
going
to
be
the
top
priority
either.
Ok,
so
in
that
case
we
really
should
be
looking
at
the
full
stripe
right
case.
As
yeah
I,
don't
know,
I
mean.
E
Sorry
I
mean
they
probably
are
at
the
Pope's
a
brief
moment
right
now
mark
what
did
you
throw
them
through
I
just
left
them
at
default,
so
whatever
whatever
currently
the
default
is
I'll.
C
Yeah
I
think
the
only
the
only
question
the
question
this
race
for
me
is:
you
are
showing
that
the
sequential
writes
were
really
fast,
which
is
to
be
expected
because
we're
writing
less
data,
but
it
shouldn't
be
any
different
than
the
oldc
like
if
you're
doing
to
control
rights
with
pre
tool
or
whatever
you
do
zombie
yeah,
even
so
I
guess
you
can't
really
do
that
comparison
quite
right,
but
no.
F
B
D
C
B
Yeah
I
think
so
I
mean
these.
These
results
seem
very
reasonable
to
me
in
terms
of
what
I
would
expect
to
see
right
I
mean
we're
we're
doing
really
well
on
super
intial
rights,
we're
doing
kind
of
crappy
on
sequential
reads,
which
makes
sense,
given
architecture
we're
we're
not
doing
well
with
4k
random,
writes
with
file
store,
which
I'm
like
saying,
sounds
very
correct
too,
so
I
think
it
I
don't
know
I
don't.
I
don't
see
anything
here.
That
seems
off
is
that
you
is
there
anything
here
that
you
guys
think
is
unexpected.
B
A
I'm
uploading
little
Doris
edge,
good
I've,
Benden,
look
at
essen
as
well
mark
and
wife.
Anne
is
sequential
reads
at
lower
q.
Depths
with
the
EC
actually
slows
show
slightly
better
performance
than
replicate.
Of
course,
I
think
it's
difference
between
if
you're
sort
of
slitting
the
chunk
sizes,
/
stripes
ec,
rather
than
doing
for
Meg
reads
in
replica
pool,
it's
pretty
am
slightly
more
efficient,
but
if
you're
Dutch
right
in
the
discs,
it
would
sort
of
get
worse.
B
A
B
B
We
are
looking
into
that,
but
if
anyone
sees
any
regressions
in
recent
master
with
file
store
and
replication,
specifically
with
partial
over
rights
or
potentially
random,
writes
of
larger
sizes,
you
know-
and
maybe
like
though
the
one
maybe
wait
to
to
make
bite
range
is
quite
likely
that
it's
this,
so
we
are
tracking
that
down
now.
I
guess.
The
only
other
thing
I'll
mention
is
that
we
have
started
doing
some
blue
store
with
that
a
scale
testing
at
Red
Hat
just
started
yesterday,
so
we're
still
getting
results
and
so
figuring
out
how
to
tune.
C
B
C
Yep
so
after
looking
at
going
through
readies
to
chars
breakdown,
latency
last
week,
I
started
to
focus
on
just
that
that
first
handoff
between
the
fastest
spot
patch
thread
putting
things
into
the
queue
and
looking
at
all
the
things
that
we're
doing
there
that
we
could
hopefully
not
do,
and
there
are
a
bunch
of
low-hanging
fruit
that
that
one
was
the
first
one.
There's
the
the
pull
request
that
you
linked
at
the
top
that
to
get
x,
mac
reserved
just
like
doing
a
banging
on
a
mutex.
C
It's
doing
a
memory
allocation
and
stl
map
that
usually
gets
removed,
and
then
we
added
on
every
single
app
but
just
like
just
kind
of
stupid
stuff,
so
that
pull
request
tries
to
address
that.
The
next
thing
is
that
it
allocates
the
all
right
before
I
forget
it
allocates
the
opera
quest,
which
is
fine,
but
then
I
notice
that
the
opera
quest
is
all
using
shared
pointers,
and
so
I
was
going
to
what
actually
wanted
to
change
the
the
queue
that
it
pushes
it
on.
C
For
this
it
pushes
it
on
to
the
sessions
queue,
and
then
it
immediately
goes
and
processes
everything
that
queue
and
usually
pops
it
back
off.
So
I
wanted
to
make
that
an
intrusive
list
so
that
there's
no
additional
memory
allocation
there.
But
in
order
do
that,
we
need
to
change
the
track
up
to
be
a
intrusive
pointers
to
refuse
intrusive
pointers
instead
of
shared
pointer
and
so
I'm,
like
two-thirds
of
the
way
through
just
fixing
that
and
there's
some
annoying
issues
in
the
MDS.
C
But
I
think
once
that's
I'm.
What
I'm
hoping
is
that
we
can
make
those
changes
and
then
rerun
that
same
latency
test
and
actually
see
a
difference,
measurable
difference
in
that
and
that
handoff
I'm
the
amount
of
time
were
actually
sending
have
this
bestest
batch.
Maybe
that's
optimistic
I.
Don't
have
a
good
sense
of
the
order
of
magnitude
the
amount
of
time
that
those
those
operations
are
taking
or
wasting,
but
it
seemed
like
a
good
place
to
start
because
all
the
stuff
is
done
in
the
OSD.
C
C
Yeah,
I
mean
I'm
not
sure
that
that's
where
the
time
is
actually
being
spent
but
Kennedy
that
pretty
deep
tub
provided
to
tell
that
it's
one
of
the
obvious
things
that's
happening.
I
guess
in
that
path.
So
how.
C
Know
we're
going
by
the
I
was
just
looking
at
this
path
because
of
the
total
breakdown
that
to
sharted
it
was
like
the
50
microseconds
on
that
that
amount
of
roughly
that
amount
of
time
that
is
total
spent
in
fast
dispatch.
Alright,.
G
C
C
A
total
number
0
0
for
a
total
ops
or
like
500,
something
I,
don't
know,
come
on
then
person,
yeah,
right,
yep,
yeah
yep,
but
it's
a
it's
sort
of
a
an
arrow
thing.
That,
in
theory,
should
be
like
super
fast
like
we're
not
doing
any
real
work,
we're
just
it
should
just
be.
The
only
thing
really
should
have
to
do
this
allocate
the
structure
that
would
it
belongs
on
and
put
it
on
that
cue
and
then
go
away
and.
G
C
Yeah,
so
the
so
the
second,
the
second
thing:
I
much
fun
to
talk
about
really
quick,
I'm,
ready,
open
to
pull
requests
with
the
trace
requests
or
the
trace
points.
There
are
two
pieces:
there's
a
there's,
a
really
simple
one
that
just
lets
you
put
a
McCarron
a
function
that
will
then
add
an
entry/exit
instrumentation,
but
I'll
TT
in
G
trees.
Points
for
that
function.
C
That
one
seems
like
a
no-brainer.
The
other
one
is
a
little
bit
trickier
because
it
has
to
annotate
sit
with
the
object
ID.
So
when
thinking
is
that
we
should
merge
the
pull
request
that
just
to
find
all
this
define
those
macros
in
two
and
then
what
I'm
less
sure
about
is
whether
we
should
merge
all
the
actual
functions
that
were
annotated
in
I'm
thinking,
probably
because
he
added
a
teammate
define
that
lets.
You
turn
them
all,
compile
them
in
her
out,
though
they'll
just
go
away,
they
won't
compile
anything
by
default.
C
The
only
tricky
piece
is
that
for
the
object,
annotation
there's
a
bunch
of
extra
field,
I
have
to
get
added
to
various
structures
so
that
we
have
the
object
name
to
put
in
the
trace
point.
So
there's
a
little
bit
of
overhead
there's
some
like
ifdef
stuff
that
gets
sprinkled
through
the
code
in
order
to
make
that
work.
But
that
seems
it
seems
like
just
kidding
these
and
master
is
gonna,
make
it
easiest
to.
We
can
adjust
that
over
time.
As
far
as
the
code
paths
that
were
interested
in
grief.
D
C
C
C
C
H
I
suits
6aj
gonna
get
bored
so
in
the
den
is
to
shut
up.
So
you
know
like
a
different
part
of
the
part
of
the
face.
You
know
the
great
job,
especially
in
a
thousand
showed,
like
some
network
network
bottom
likes
to
as
an
model
for
the
messenger
in
the
network
layer.
So
I
started
looking
at
at
the
latest.
You
know
the
DPD
Kevin's
chasing.
H
C
C
H
C
B
C
Yeah,
so
yeah
I
any
case.
Yes,
it
would
be
great
to
see
that
that
comparison
as
a
next
step-
oh
I,
guess
the
one
other
thing
I'll
mention
is.
C
How
my
team
on
that,
but
I,
think
I
think
that's
probably
a
good
path
forward,
because
it
means
that
all
of
the
protocol
logic
is
going
to
be
the
same
across
RDMA
and
tcp
and
db2
k
whatever
so,
transport
part
is
sort
of
separated
out,
which
will
be
particularly
important
when
we
do
messenger
v2,
hopefully
during
this
next
next
cycle,
that
lab
decryption
and
all
the
other
stuff.
So.
C
B
If,
if
there
isn't
anything
else
and
people
are
open
to
it,
I
was
wondering
if,
since
we
have,
everyone
here
would
be
at
all
interesting
to
discuss
Igor's
proposal
on
the
mailing
list,
there's
no
right
or.
C
Yeah
yeah
I'm
sympathetic,
but
I'm
I,
thought
I'm,
not
sure
I'm
ready
to
give
up
yet
I,
don't
know,
there's
that
the
nice
thing
about
using
an
existing
key-value
databases
that
we
sort
of
avoid
this
huge
pile
of
tough.
To
maintain
its
it's
easy
to
imagine
that
we
can
do
way
better
for
doing
like
for
the
or
the
sequential
time
serious
stuff
for
PG
log
entries.
I
Think
you're
either
gonna
if
I
think,
if
you
use
something
off
the
shelf
you're
going
to
find
that
there's
a
you
know
a
gap
from
what
could
be
done
with
something
custom
and
you're
going
to
pay
that
price.
That's
what
you're
looking
at
you
know
and
it's
something
that's
custom
designed
ultimately
will
do
better.
But
it's
a
lot
of
work.
Yep.
D
J
D
K
Well,
actually,
I
was
trying
to
would
use
the
database
tragic
for
not
metadata.
Primarily,
it
seems
pretty
strange
for
me
that,
for
for
gay
rights
we
actually
have
database
traffic
similar
to
the
one
from
to
block
device
or
even
higher,
and
so
I
suppose
it's
that
database
is
not
intended
for
such
a
traffic
and
we
never
getting
the
same
performance
as
block
device
of
a
idea
was
to
use
that
as
much
as
possible.
C
K
C
That
that
might
make
that
sort
of
partial
change
might
be
something
we
should
consider.
I.
Think
ll,
that's
a
good
point
that
that
the
workload
is
randomly
distributed
across
the
object
namespace,
and
so,
when
we're
doing
these
sort
of
big
updates
in
the
objects
based
namespace
or
we're,
giving
the
the
kv
back
in
sort
of
a
worst
case
or
code
well
for
rocks.
That's
correct,
yeah.
I
Maybe
your
private
is
God
right.
You've
got
you've,
got
two
end
points
of
the
spectrum
in
terms
of
behaviors.
You've
got
yo
PG
logs
and
PG
infos.
You
know
there
are
you
know
that
essentially
sequential
and
then
you've
got
something.
That's
basically
random.
You
know,
and
you
know,
I
think
you
can
handle
both
but
you're
not
going
to
find
an
algorithm
for
both
what
you,
what
you
have
to
do
is
to
create
essentially
multiple
algorithms
that
have
the
same
transactional
layer
on
top
of
them,
and
you
know
that's
not
out
there
now.
I
If
you
go,
if
you
go
look
at
what
the
Zetas
scale
guys
have
done,
you
know
they
started
from
sort
of
the
opposite
end
of
the
spectrum
of
rocks
with
a
b-tree
which
handles
the
random
stuff
pretty
well,
but
is
terrible
on
all
the
sequential
stuff
and
then
what
they
did
was
they
they
added
some
specialized
behavior
into
their
to
deal
with
that.
You
know
and
have
improved
the
you
know.
If
you
count
basically
the
right
now,
you
have
much
better
characteristics.
I
The
the
unfortunate
issue
is,
if
you
look
at
this
from
up
from
a
right-hander,
you
know
perspective
with
with
rocks
what
you
have
is
sort
of
one
you'll
forget
the
data.
If
you
just
look
at
the
metadata,
you
got
one
right,
plus
whatever
your
compaction
overhead
is,
whereas
with
Zetas
scale,
because
of
the
way
that
you
know
the
history
of
the
code,
the
best
you
can
do
is
actually
two
rights.
Okay,
but
you
don't
get
the
compaction
overhead
I
mean
you
have
to
treat
splitting
overhead
stuff
like
that.
I
Okay,
so
you
end
up
with,
like
you
know,
two
and
a
quarter
or
two
and
a
half
as
the
as
the
number
as
the
metadata
right
there.
So
for
really
small
you
rox
is
going
to
win,
but
as
the
volume
size
gets
larger
and
larger
it
crosses
over,
and
you
know
we're
seeing
that
crossover.
You
know
in
the
in
the
sort
of
low
double-digit
gigabytes
right,
wait
a
minute.
Hang.
D
C
I
I
D
I
think
complain
about
cabela's
tourism
isn't
really
the
point,
even
if
she
did
something
custom
you're
still
going
to
have
to
do.
These
updates
well,
I
think
it's
pretty
good
argument
against
rocks
TB,
but
I.
Don't
think
it's
an
argument
against
doing
something
custom
for
doing
for
doing
something.
Custom.
I
I
I
I
think
if
he
it
once
you
what
you
say,
you
have
metadata
with
a
certain
set
of
characteristics,
and
then
you
ask
yourself:
what's
sort
of
the
optimal
a
lot
of
it
looks
a
lot
like
kv
there
I
mean
I
have,
but
you
know
the
the
screen
that
I
published.
I
have
issues
about
the
kv
interface
itself
from
a
performance
perspective,
cpu
performance
and
complexity
wise,
but
give
that
sort
of
orthogonal
know
that
the
other
issue
that
we're
dealing
with.
I
I
Okay,
yeah
I'm,
not
against
it,
I'm
just
saying
you're,
paying
we're
paying
a
price
for
it.
Okay,
you
know,
and
that
price
is
high
value.
What
you
push
the
other
stuff
out
of
the
way
you
paid
a
lot
of
cpu
for
the
simplicity
of
the
kv
interface.
You
know
you
gained
some
time.
There's
no
doubt
about
that.
C
Yeah
I
mean
I
guess
but
I.
What
I'm
wondering
is
if
we
should
consider
some
other
back
ends
here
instead
of
rocks
TV,
so
the
ones
that
come
to
mind
are
wired
tiger,
which
I
know
has
a
bunch
of
different
storage
drivers.
They
have
like
a
mind
if
I
remember
correctly,
have
a
b-tree
one
and
they
have
like
a
bell
someone
and
they
have
a
logging
one
day.
I
think
I'd,
remember
that
multiples.
So
that
might
be
one
possibility.
I
C
Would
be
that's
one
question
to
answer
there's
another
one
too:
that's
thatthat's,
toko
DB
that
I
think
was
gobbled
up
by
percona.
So
now
it's
called
/
coma,
ft,
Sarah
fractal
tree
thing,
one
of
the
nice
things
there
is
that
it
Big
D
range
delete,
switch
rocks
if
he
doesn't
currently
do,
although
it
certainly
could,
if
they
just
implemented
it,
but
I'm
not
sure
if
they
have
I
think
the
behavior
probably
going
to
be
pretty
similar
to
lsm
fact.
I
I
I
Lsm,
if
you're
lsm,
you
you've
already
sort
of
fundamentally
addressed
that
issue,
it
made
it
pretty
efficient.
What
you're
doing
is
optimizing,
something
that's
already
pretty
efficient
and
ignoring
the
thing.
That's
killing
you
yeah,
yeah
yeah,
so
yeah
Alex
Sam
is
just
you
know,
is
not
going
to
come
period.
C
I
Well,
I
think
we've
got
Zetas
scale
where
it
is
okay,
yeah
I
think
that
you
know
my
analysis.
What
I
looked
at
seda
scale
is
like
I
said
that
we're
getting
a
right
angle.
You
know
two
and
a
half
two
and
a
quarter
for
batched
transactions,
and
you
know
I.
If
I
had
a
green
piece
of
paper,
I
could
get
that
down
to
you
know,
but
probably
one
and
a
half
or
something
like
that:
okay,
22
or
reduce
the
batch
size.
I
You
know
and
keep
the
right
have
to
say,
but
none
of
that
addresses
the
interface
issues,
which
is
the
you
know,
the
inefficiency
of
scanning
things
that
are
that
are
nearly
adjacent,
the
lack
of
reference
semantics,
forcing
you
to
fracture
large
objects,
you
know
and
you
which
are
frankly
significant
performance.
The
other
issue
that,
frankly,
that
we
haven't
even
grappled
with
yet
that
you're
going
to
find
you're
paying
a
lot
of
energy
for
later
on
is
the
fact
that
all
of
these
databases
are
threads
synchronous
in
their
in
their
read
false.
I
So
you
know
ultimately
your
pagal
on
extra
context.
Switching
for
for
that,
you
know
which
you
know
that,
frankly,
what
it's
all
said
done
is
going
to
lard
up
the
path
lengths
for
a
transaction
I
mean
if
you
really
want
to
get
you
know
into
the
you
know
the
CPU
efficient
with
this
you're
going
to
have
to
abandon
all
that
code.
I
Anyways,
because
there
isn't
anybody
out
there
that
I've
seen
that
has
you
know,
run
to
completion,
interrupt
style,
kv
implementation,
they're
all
thread
based
and
fundamentally
on
flash
you,
you
basically
thrown
a
decent
chunk
of
your
performance
out
the
window.
Just
by
doing
that.
I
I
I
I
suspect
you
you,
if
you
take
a
kv
interface,
somebody
like
as
in
a
scale
or
fewer
rocks,
that's
thread
based.
You
could
probably
do
better
with
a
user
space
thread.
Switching.
It
creates
your
sort
of
little
miniature
simulated,
threading
environment,
that's
not
as
good
as
a
fully
rewritten
a
sec
run
to
completion
interface,
but
you're
going
to
have
to
simulate
a
threaded
environment
for
these
guys,
but
without
the
overhead
of
the
colonel
switch
context,
which
is
pretty
heavy,
you
know,
I
think
you
could
actually
do
that.
I
You
won't
be
quite
as
efficient,
but
the
Delta
might
be
reasonably
small,
so
you're
going
to
end
up
having
to
do
that
and,
of
course
that's
going
to
be
recreating
a
certain
amount
of
locks,
organization,
code
and
IO
code.
I,
don't
think
that's
a
huge
deal.
You
know
it's
a
week
or
tues
work,
maybe
three
or
four
by
the
time
it
set
done.
Yeah.
I
I
D
We
wanted
to
seriously
think
about
this
times.
The
right
answer,
I
think
Abby's,
not
so
much
to
talk
about.
He
interprets
as
or
not
supposed
to
talk
about
the
on
this
for
months,
but
right
down
the
interface
you'd
like
to
see
the
database
have
this
one
way
or
another.
We
still
want
an
abstraction.
There
right
well
agree
yeah.
It.
D
I
Well,
yeah
I
think
this
is
an
interesting
discussion
for
another
day.
Yuh.
The
reality
is
you
we
are
where
we
are
with
blue
store.
You
know
and
it
sits
in
a
code
base,
and
you
know
if
going
back
to
sam's
point
about.
Where
do
you
spend
your
energy?
You
know
I
would
rather
than
throwing
down.
You
know
several
man
years
right
now
tearing
up
the
OSD
a
brief
they
can
get
from
scratch.
You
know,
I'd
go,
spend
a
little
bit
unwrap.
I
D
I
Iii,
don't
disagree
with
you
I'm
just
saying
bang
for
the
buck.
You
know
that
that
I
think
it's
gonna
heal
a
lot
more
I'll,
be
I
just
sent
an
email
to
Santa
I'm
concerned
that
where
we
got
it
from
my
perspective,
I'm
not
sure
we
haven't
done
the
equivalent
of
essentially
crippling
that
feature
to
the
point
where
it's
actually
no
more
useful
than
it
is
in
the
jewel
release,
not.
I
D
I
Whom
it
is
useful,
I
sure
that's
true,
and
that's
that
that
decision
will
have
potentially
has
ramifications
for
me
that
I
have
to
go
figure
out,
and
but
you
know,
if
you
know
I,
you
know,
I
would
put
that
a
higher
on
my
list
of
things
that
a
rewrite
of
the
OSD.
I
D
I
D
D
D
I
get
and
they're
just
there's
no
way
to
do
this
in
an
ad
hoc
fashion,
without
making
the
OSD
entirely
you'll
of
incomprehensible.
D
I
B
I
Well,
I'm,
not
okay,
so
I'm
not
sure
that
I
have
the
information
in
hand
to
be
able
to
offer
that
right
now,
okay,
okay,
I
mean
there's
a
I.
Think
that
answer
that
question
I
think
requires
a
certain
set
of
synchronization
amongst
the
participants
on
a
list
of
things.
Like
estimates
of
available
manpower,
estimates
of
the
costs
of
potential
alternatives,
I
think,
if
you
are
in
agreement
on
what
those
are,
then
the
preferred
path
becomes
more
about
your
disagreement
on
the
numbers
than
it
is
a
statement
of
the
value
of
them.
B
B
Basically
of
the
messenger
work
and
looking
at
some,
these
trace
point
analysis
is
really
good.
I
think
that
that's
what
we
want
things
we
need
to
be
doing,
I,
think
looking
at
blue
store
and
any
way
possible
to
reduce
the
amount
of
metadata.
There
is
really
good
and
or
looking
at
ways
to
reduce
right
amp
and
whatever
key
value
store.
We
end
up.
Choosing
is
really
good.
Well,.
I
Let's
be
a
little
careful,
ok,
I
think
that's
actually
a
downstream
effect
of
what
you're
really
trying
to
optimize
for
under
a
set
of
assumptions.
I
thought
that
I
would
question
okay,
so
I
mean
yes,
I,
think
meta
less
metadata
is
always
better.
It's
always
better
for
everyone.
Okay,
but
it
is
the
value
of
it
is
different
in
different
environments.
The
costs
of
it
are
different
and
they
find
out
that
they're
an
epsilon.
I
So
you
know
with
I
think
the
bigger
issues
to
deal
with
our
to
ask
the
question:
given
the
structure
or
blue
store,
as
we
have
it
today,
what's
the
best
that
we
can
get
it
to
be
I'm,
you
know
I'll
be
honest,
you're,
the
changing
back
ends
at
this
point,
you
know,
is
something
that
will
take
a
long
time.
B
I
In
a
bit,
let's
put
it
this
way,
I
mean
you
just
put
out
the
crack
and
released
what
you
diet.
Well,
okay,
both
the
people
are
expecting
it
in
their
Christmas
basket.
Now
you
can
undo
that,
okay,
you
know
and
there'll
be
a
certain
amount
of,
depending
on
how
much
you
want
to
do
that
if
you
undo
that
by
a
month
in
if
you
undo
that
by
a
year,
you're
going
to
get
some
egg
on
your
face.
Okay,.
I
That's
deal
and
I
think
before
you
and
Mark
I'll.
Take
your
your
question
as
me
stating
our
preference
of
which
of
the
various
options
to
go
take
and
what
I'm
saying
is
is
I
think
until
we
agree
on
what
the
options
are.
There's
no
way
to
answer
that
question:
yeah
yeah,
because
changing
the
KD
back
n
you
know,
is
not.
First
of
all,
you
don't
want
to
get
yourself
back
into
the
boat
that
you're
in
now
with
rocks,
which
you
know
happened
because
there
was
no.
I
There
wasn't
enough
short
of
research
on
connecting
the
abstract
model
to
the
physical
implementation,
to
see
whether
or
not
that
implementation
was
going
to
be
good
enough.
So
before
we
so
having,
basically,
you
know,
took
a
risk,
and
you
know
and
rolled
the
dice
and
coming
up
I.
You
know
less
than
you
wanted
I.
Don't
think
you
want
to
repeat
that
exercise
so
before
you'd
be
even
really
ready
to
switch
to
some
other
theoretical
KD
back
end
I
think
you've
got
a
fair
amount
of
work.
I
B
There's
there's
a
little
bit
with
rocks
TV
I
think
you
have
to
be
a
little
careful
there
too.
In
that
we
we
didn't
really
know
what
we
would
see
with
rocks
DB
until
we
tried
it
right,
I
mean
there's,
there
was
no
I
want
so
there's
no
way,
but
it
would
be
very
difficult
to
guess
from
the
very
beginning
that
rocks
TV
would
necessarily
shoe
badly,
and
it's
done
much
better
than
say
leveldb
would
have
done.
I
will
very
strongly
maintain
that
is
it
possible
that
we
can
improve
what
actually.
I
I
I
think
you
did.
You
have
to
be
careful
how
you
state
that
I
think
the
only
thing
you
could
really
state
is
the
the
stalls
during
compaction.
The
rocks
TB
does
a
better
job
than
leveldb.
Does
okay,
if
you're,
if
you're
just
you
know,
if,
let's
let's,
let's
for
the
moment,
assume
that
you
could
take
leveldb
into
it
in
a
way
to
or
decide
that
the
stalls
were
irrelevant
or
whatever
reason
I'm,
not
sure
you'd
see
it
performed
act
of
a
lot
better.
I
I
I
Yep,
you
know
so
you
know
I
I,
think
the
problem
is
you,
you
I
think
you've
got
the
classic
tactical
strategic
divide
here,
and
you
know
your
tactical
choices
are
constrained.
I,
don't
think
it
takes
long
to
make
a
tactical
decision,
because
the
choices
are
constrained.
It's
just
enumerate
the
choices
and
figure
out.
If
there
is
some
work
to
do
to
augment
the
knowledge
base,
you
probably
small,
you
know,
the
strategic
vector
is
a
whole
different
issue
and,
let's
not
conflate
the
two
right
now.
I
You
know
I
I,
think
that
you
know-
and
you
know
and
and
clearly
everybody
is
building
the
conclusion
that
says:
you're
gonna
need
to
rebuild
the
OSD
and
when
you
do
that,
you
know
you're,
probably
gonna
have
to
give
your
you're
gonna
revisit
this
issue
of.
I
What's
the
storage
engine
abstraction
not
just
the
interface
to
it,
you
know,
which
you
know,
I
think,
has
issues
but
also
the
you
know
the
semantic
behavior
of
it
etc,
and
you
may
choose
expediency
there
and
take
something
that's
less
optimal
to
market
sooner
or
you
may
choose
to
say.
No.
You
know
this
do.
This
is
an
area
where
my
fundamental
performance
is,
and
I
need
something
that's
you
know
pretty
optimal
and
unfortunately
there
isn't
a
thing
out
there.
That
does
what
I
want.
I
You
know
I
think
yeah
I
mean
you
mentioned
a
couple
of
other
key
value
stores.
If
you
know
it's
certainly
worth
a
day
or
two
reading
through
the
documentation
on
them
to
see
if
they
seemed
to
offer
this
kind
of
behavior
specific,
you
know
we're
looking
for
essentially
a
palette
of
behavior,
specific
stuff.
We'd
like
to
be
able
to
say
these
keys,
you
know,
should
have
a
this
kind
of
engine
of
these
keys
should
have
that
kind
of
it.
I.
I
J
I
Do
between
we've
got
about
three
different
sets
of
hardware
all
flash
in
our
repertoire
and
the
the
optimal
choice
here
is
not
what
you
would
expect
on
some
of
them.
Oddly
enough,
the
the
other
thing
here
is.
Is
we
haven't
even
yeah
I,
don't
know
that
anybody
I
don't
know
how
much
work
has
been
done
on
the
hard
drive
side
which
matters
yeah.
B
On
the
outside,
at
least,
and
the
things
that
that
I've
done,
the
metadata
load
seems
to
be
low
enough,
but
it's
we're
more
dominated
by
the
actual
I.
Oh
then,
then,
what's
going
on
and
inside
you
know,
Rockets,
TB
or
or
potentially
Zetas
scale.
I,
don't
know
there,
but
at
least
with
with
rocks
TV.
It
seems
like
we're
not
seeing
a
lot
of
the
data
movement
between
different
levels
and
overhead
there
that
we
see
on
one
with
easy,
fast
writes
the
nvme
arcade
applaud.
B
I
I
mean
the
cost
for
compaction
are
radically
different
on
a
hard
drive
or
they
are
a
flash.
You
know,
that's
gonna
change
your
view
of
the
bill,
the
critical
nature
and
what's
good
performance
or
not,
etc.
There
got
agree
of
that.
I
guess,
I'm,
probably
more
interested
in
the
flash
over
her
drive
model.
I
Yes,
yeah,
I
think,
is,
I
think,
that's
really
the
sweet
spot,
the
all
hard
drive
model,
I
think,
you'll
see
it.
I'm,
not
interested
I'm,
not
terribly
worried
about
optimizing
for
it
and
I'll
reserve
the
right
to
change
that
opinion
when
some
salus
needle
drops
in
my
lap
but
but
until
then,
okay,
you
know
I'm
not
terribly
worried
about
it.
I
think
you
know,
I
think
it's
going
to
underperform
your
existing
back
at,
but
maybe.
I
C
Oui
well,
I
that
the
concern
is
when
you
ask
well
yeah
the
expectation
that
my
expectation
is
that
when
we
have
lots
of
small
I,
oh
and
we
fragment
helpful
and
we
have
annotated
explosion
yeah,
then
it
the
worst
case
could
be
slower
but
yeah,
but
I
I,
agree.
I'm,
mostly
concerned
with
making
like
large
object,
storage
workloads,
work
well
on
all
discs,
because
I
think
that's
where
they're
gonna
be
used
wrong
and
I'm
left
just
about
optimizing
for
you
know,
are
we
do
random
random,
ayos,
4k,
random
iOS,
whatever
run
in
a
spitting.
I
I
I
agree
with
that.
Okay
and
you
know,
and
as
far
as
the
OSD
it
to
me,
it's
the
hybrid
environment
it
at
this
point.
You
know
a
small
amount
of
flash
over
rotating.
Isn't
it's
pretty
cost-effective
and
likely
more
so
going
forward?
But
you
know
the
you
I
think
the
reality
I'll
be
honest
with
you,
I
think
you
got
what
you
got.
Okay,
you
know
if
somebody
could
pull
another
kv
store
with
the
right
characteristics
out
of
their
hat
and
drop
it
in
yeah,
okay,
you
know
they
do
fit
with
you.
I
If,
if,
for
example,
berkeley
DB
b
0
or
one
of
these
off-the-shelf
kv
stores
had
exactly
what
you
need
it,
okay
then
yeah,
you
can
probably
drop
it
in
in
a
few
weeks.
You
know,
and
life
would
get
better
I,
don't
know
how
much
better
you
know,
I'm,
not
sure
it's
going
to
do
a
lot
better
than
what
we're
seeing
was
in
a
scale.
Okay
and
well.
I
Okay,
but
that
did
you
my
point
is:
is
that
about
saying
that
that's
not
worthy
of
consideration,
but
that's
not
a
tactical
choice.
That's
a
strategic
choice
that
that's
not
going
to
happen
for
luminous.
Okay,
that's
not
going
to
happen
for
you,
so
you
know
you.
You
either
have
to
stay
the
course
with
what
you've
got
in
front
of
you
as
blue
store
and
that's
going
to
show
up
in
luminous
as
blue
store,
with
the
characteristics
that
it
has.
I
Okay
or
you
know,
because,
let's
remember,
there's
still
a
lot
of
work
left
in
blue
store,
it
ain't
done
by
a
stretch
of
the
imagination,
you
haven't,
spent
a
lot
of
energy
on
the
dynamics
yet
guys.
Okay,
we
haven't
really
explored
a
lot
of
the
performance
space
either
mom
not
to
mentioned
all
the
other
little
ugly
issues
about
getting
storage
code
into
production.
So
you
know
if
we're
lucky,
we're
seventy
percent
of
the
way
there.
I
Eighty
percent
of
the
way
there
you
know
if
we're
unlucky,
we're
fifty
percent
of
the
way
there,
but
setting
that
aside
for
the
moment.
I
C
I
I
You
know
with
the
RDMA
and
the
memory
footprint
and
all
the
other
issues
that
we
know
you
know,
okay,
so
it
picks
up
another
twenty
or
thirty
percent
big
deal.
You
know
it's
not
going
to
pick
up
an
order
of
magnitude,
because
even
if
the
OS
decent,
even
if
the
object
store
interface,
is
infinitely
fast,
you
don't
actually
go
very
fast.
No.
C
Yeah,
that's
that's
what
I'm
saying,
though
it's
not
going
to
with
within
the
current
framework
of
the
OSD,
then
I,
don't
think.
There's
any
point
of
doing
a
rewrite
of
REE
are
watching
a
blue
star,
which
is
why
we
should
continue
with
the
current
path
and
investigate
continue
looking
at
how
we
can
improve
the
key
value
back
end.
But
after
doing
that,
rewrite
features
based
rewrite
of
the
OST
guts.
I.
Think
then
we'll
know
more
about
what
interface
would
be
more
optimal.
C
I
I
mean
in
you
know,
I
think
when
you
redo
the
OSD
you're
going
to
redo
the
communications
interface.
In
order
to
you
know
to
do
something
that
is
deployable
at
scale
with
with
our
DNA
okay,
you
know,
you
know
the
way
that
memory
gets
bandaged
in
an
our
DNA
environment
will
permeate
its
way
through
the
OSD.
I
mean
you,
your
good
friend.
The
buffer
list
might
not
look
like
it
is
today
when
you're
done.
I
You
know
where
to
start
that
in
and
of
itself
is
liable
to
be
the
thing
that
forces
you
to
jettison
the
existing
code
base
and
that
you
know-
and
my
guess
is
the
details
of
the
object
store
equivalent
interface
in
that
world
are
a
small
part
of
the
problem,
not
a
big
part
of
it
said
you
know,
yeah
and
I.
Think
that's
a
great
thing,
but
I
think
you're,
you
know,
I
think
you
got
to
realize.
Is
that's
29
years
of
work
to
pull
that
together?
I
That
was
really
needs
to
be
backed
by
corporate
entities
that
you
know
believe
that
exactly
ok,
so
the
honest
with
you.
This
is
the
concern
that
that
I
was
a
little
expressing
with
Sam,
which
is
you
know,
we're
on
a
trajectory
to
have
a
you
know:
erasure
coated,
overwrites,
you
know
and
the
backing
off
of
that
now,
because
the
world
has
to
be
rebuilt.
To
do
that.
I
You
know
is
a
major
issue
for
for
me
you
know-
and
that's
maybe
I,
maybe
I
hadn't
heard
it
before
and
I
should
have
heard
it
okay.
But
if
that's
a
real
problem
for
us-
and
we
were
focusing
on
using.
I
Know
basically,
what
we're
saying
now
is
that
there's
no
or
racial
coating
on
blocks
you
know
in
the
pipeline.