►
From YouTube: Ceph Performance Meeting 2020-09-02
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
All
right,
wanna
get
started
on
pr's
here
and
then
we'll
move
on
to
discussing
blue
store
own.
So
I
saw
two
new
pr's
this
week.
One
was
from
josh
to
speed
up
caps
and
cap
updates
in
the
monitor
josh.
I
saw
that
that
joao
had
reviewed,
that
it
doesn't
look
like
we
changed
default,
behavior
right.
B
That's
right:
yeah,
the
keyways
is
confused.
The
default
favorite
doesn't
change,
it's
only
a
niche
use
case,
so
we
just
have
an
option
to
avoid
parsing
in
that
kit.
In
that
case,.
A
So
so
I'm
not
super
familiar
with
this
pr,
but
are
we
primarily
limited
by
all
the
object,
creation
and
deletion
using
that
boost
framework
or
what?
What
exactly?
What
are
we
seeing
now.
B
Well,
right
now,
this
option
just
skips
the
parking
entirely
when
the
pricing
is
enabled
for
the
osd
capabilities.
I
think
we
are
limited
by
I'm
not
sure
if
it's
the
option,
creation
necessarily,
but
that's
my
intuition,
a
bunch
of
different
different
calls
within
bruce
spirit,
showed
up
in
the
in
perv
okay,
but
without
the
parsing,
as
we
seem
to
be
limited
more
by
rockstar,
perhaps
perhaps
over
some
kind
of
locking
the
monitor
with
your
script
to
analyze
the
logs
mark.
B
I
wonder
if
there's
some
kind
of
doing
we're
running
into
there,
that
we
could
change
with
with
the
I
think,
it's
like
right
right,
blocking
behavior
or
something.
A
How
often
were
we
seeing
sorry?
I
don't
remember
on
the
monitors
when
we
ran
that
script
for
looking
at
the
rocks
db
compaction
statistics
were
we
hitting
compaction
very
often
or
was
it
kind
of
like
intermediate
or
sorry
intermittent.
B
Lot
of
we'll
go
ahead.
Yeah,
I
guess
we're
pushing
our
surprising
amount
of.
I
mean
the
database
itself
is
pretty
small.
It's
like
an
order
of
30
megabytes
or
something,
but
I
guess
those
levels
are
are
initially
small
enough
that
we're
still
running
into
a
lot
of
compaction.
B
A
A
It
could
be
interesting
just
to
try
an
experiment
where
you
create
much
larger
buffers
in
proxy
b
and
the
right-hand
log
the
same
way
that
we
do
in
blue
store
and
see
if
that
reduces
the
the
amount
of
of
right
amplification
coming
in.
A
That's
that's
why
we
have
such
huge
ones
in
blue
store,
it's
kind
of
ridiculous,
how
big
they
are
this
it.
You
know
it
seems
to
be
what
how
we
can
get
around.
Having
like
you
know,
lots
and
lots
of
keys
end
up
being.
You
know,
put
into
l0
that
are
probably
then
like
deleted
really
soon
afterward.
A
Maybe
we're
doing
something
similar
with
the
with
the
mon.
B
A
Okay,
interesting,
okay,
let's
see
what's
next
change
the
default
value
of
osd
async
recovery
min
cost.
I
don't
even
remember
that
option.
So
so
yeah
josh
do
you.
You
know
what
how
that
affects
things.
I
don't
I
don't
remember
it.
C
Yeah,
that's
basically
the
threshold
at
which
async
recovery
kicks
in
and
the
the
pr
proposes
that
we
reduce
the
threshold
so
that
we
can
do
more
async
recovery,
and
this
is
based
off
of
some
testing
internal
testing.
That
was
done
by
the
by
some
of
the
folks
at
red
hat,
and
they
saw
that
with
the
default
value
of
100.
C
They
saw
a
lot
of
asian
recovery
happening
on
rep,
especially
for
rgw
workloads.
They
saw
a
lot
of
async
recovery
happening
on
replicated
pools,
but
not
as
much
or
like
almost
zero
on
ec
pools.
Now
it's
totally
possible
that
the
workload
they
are
using
or
you
know
the
setup
they
have-
is
not
able
to
induce
async
recovery.
C
The
thresh
is
not
able
to
cross
the
threshold
of
100,
but
their
proposal
is
that,
since
there
is
no
harm
in
having
more
async
recovery,
we
should
just
reduce
the
threshold
and
there's
been
some
discussion
on
the
pr
I
think
shea
I
think
he
earlier
proposed.
I
mean
he
said
that
they
are
using
a
min
cost
of
one
in
their
cluster
and
that's
been
the
case
because
they
want
more
and
more
async
recovery
happening.
C
But
I
do
agree
with
his
latest
comment
that
there
is
a
caveat
that
during
upgrades,
if
you
have
or
not
just
upgrades
when
you
have
mixed
clusters,
running
different
versions
with
different
min
costs,
it
is
possible
that
choose
acting
does
go
into
a
loop
where
we
might
just
not
want
to
get
go.
There.
C
We've
seen
issues
like
that
in
the
past,
so
you
we
should
only
be
changing
the
cost
when
we
have
only
one
version
running
in
the
cluster,
so
I
have
to
think
about
that
a
little
more
and
then
we
can
see
how
we
can
implement
that.
But
in
general
it's
a
good
idea
to
reduce
the
threshold.
A
Right,
let's
see
those
are
both
of
the
two
new
ones
that
I
saw.
We
have
one
closed
this
week
from
majiang
peng
that
was
enabling
rocks
to
be
pipelined
rights.
I
still
think
that's
actually
not
necessarily
a
bad
idea
in
in
general,
but
it
turns
out
that
when
mahjian
peng
did
more
testing
it
looked
like
it
was
not
really
having
much
effect
and
that
it
wasn't
actually
affecting
a
whole
lot
of
ios
that
doing
that
was
not
as
impactful
as
he
thought.
It
would
be.
I
guess
so.
A
Anyway
he
closed
it.
Maybe
someday
we'll
revisit
that.
But
for
now
it's
it's.
You
know
fine.
We
have
plenty
of
other
things
to
worry
about.
So
let's
see
updated,
prs.
Majiang
pang
also
has
another
pr
for
reducing
buffer
list.
Rebuilds
radik
reviewed
that
first,
and
I
think
maybe
even
was
looking
at
a
more
general
solution
for
it,
but
he
approved
the
pr
and
igor
also
approved
it.
A
It
did
go
through
a
round
of
testing
and
it
looks
like
there
were
some
failures,
so
it
might
need
some
fixes
so
back
in
my
gm
pegs
hands
now
to
to
see
what's
wrong.
Also
optimizing.
The
lock
up
blue
store
writing
process.
Oh
this
is
the
the
one
that
no
one
wants
to
touch,
but
kifu
actually
said
he
was
going
to
in
the
pr.
A
So
my
my
hat
is
off
to
him
for,
for
taking
the
time
to
to
look
at
it,
so
hopefully
keith
will
be
able
to
make
a
determination
of
whether
or
not
we
really
want
to
take
this
on
or
not
beyond
that
lots
of
stuff
in
the
movement
category,
probably
still
the
ones
that
they
keep
coming
back,
that
we
need
to
look
at
are
reducing
no
node
memory
usage,
so
both
igor's
pr,
potentially
and
then
also
the
thing
I
wrote
a
while
back
almost
a
year
ago
now
to
separate
out
the
block
cache
into
multiple
block
caches
per
column
family
just
to
allow
us
to
to
handle
block
cache
for
oh
nodes
and
block
hash
for
omap
separately,
also
extents
and
all
the
other
blue
storm
metadata.
A
So
those
still
need
to
be
worked
on,
maybe
after
I'm
done
with
this
a
ml.
So
if
I
can
get
back
to
it,
that's
it.
As
far
as
I
saw
anything,
I
missed
guys.
A
All
right,
well
then,
adam
or
gabby,
would
either
of
you
like
to
talk
about
the
the
things
that
you've
been
looking
at.
In
terms
of
oh
note,
refactoring
adam,
I
know
you
had
a
document
that
you
had
sent
out
earlier.
Maybe
we
talked
about
that
first
and
then
gabby.
You
can
talk
about
your
stuff.
D
I
I
just
constructed
a
proposal
how
to?
How
can
we
do
that
and
it's
just
what
it
is
I
mean
I'm
not
even
supposing
that,
maybe
that
we
definitely
should
do
it,
but
I
like
to
put
it
as
a
at
least
as
a
point
that
this
point
in
a
space
of
possible
solutions
exist,
and
there
is
a
cost
metadata
will
be
much
larger.
There
is
a
benefit
right
after
you
get
access
to
your
metadata,
either
by
getting
it
from
roxdb
or
reading
from
disk
or
whatever
you
can
access
it
as
any
other
data
structure.
A
So
you
know
at
different
times
in
the
past
we've
we've
talked
about
things
like
completely
ditching
encoding
and
just
writing
the
buffer
list
directly
into
roxdb
or,
potentially,
you
know
a
gamer
like
var
engine
coding
and
anything
that
takes
a
lot
of
time.
I
guess,
or
you
know,
maybe
a
more
exotic
thing
that
might
be
storing
an
encoded
form
of
the
data
in
memory
in
in
stuff
and
then
only
doing
decoding
when
we
actually
need
to
access
something.
A
I
guess
the
the
thing
that
I
have
seen
keep
coming
back
over
and
over
again
is
that
we're
being
asked
to
reduce
cpu
usage.
A
People
want
to
be
able
to
run
osds
with
one
or
two
cores
and
have
the
rest
of
the
system
available
for
hyper
converged
processes.
Other
things
you
know,
user
processes,
other
you
know
whatever
right
now,
with
nvme
drives
a
blue
store,
backed
osd
can
easily
take
between
10
and
14
cores
from
what
I've
seen
when
faced
with
a
very,
very
heavy
random
right
workload.
A
E
Can
I
jump
in
yes
absolutely
yeah,
so
I
I
I
came
from
a
different
kind
of
system
which
was
heavily
focused
on
performance.
It's
like
unlike
safe,
which
the
algorithm
allows
to
grow.
The
system
could
not
grow
beyond
16
nodes,
so
16.
What
was
the
absolute
maximum
you
could?
You
could
use
so
we
always
been
forced
to
squeeze
as
much
as
much
performance
as
possible
from
each
node,
because
instead
you
could
always
say
you
know
what
take
another
100
nodes
there
are
there,
so
you
used,
we
had
like
each
one
of
them
was
extremely
expensive.
E
Every
time
you
add
another
node,
it's
hundreds
of
thousands
of
dollars,
not
very
common.
One
thing
I
I
noticed
again
and
again
when
you
look
at
the
code
and
it's
very
different
in
the
way
I'm
used
to
is
everything
here
is
strings.
I'm
used
to
seeing
stuff
in
binary
format.
E
E
So
I
think
an
easy
target
would
be
to
move
away
from
string
and
concentrate
on
binary
format
from
whatever
is
possible.
So
one
example
I
give
in
the
o.
Node
is
the
attribute
format,
so
we
have
in
every
o
node
we
have
map
of
attributes
which
is
strings
so
like
the
way
I'm
used
to
seeing
things
like
this
is
that
you
have.
E
If
the
attributes
are
coming
from
a
predefined
set
of
attributes,
then
you
enumerate
them
and
instead
of
writing
the
attribute
name.
You
just
have
an
index
saying
that's
that
attribute
and
reader
code
shouldn't
care
about
the
name.
We
should
just
use
binary,
compare
binary
operation,
the
strings
are
only
for
the
users,
but
if
you
need
them
for
debug
purposes,
then
there's
always
a
debug
function
which
can
access.
That's
right.
E
That's
the
easy
case.
If
your
strings
is
com
are
coming
from
a
much
bigger
pool,
and
you
don't
know
the
definitions,
then
you
could
create
translation
table
on
the
fly,
and
I
suggested
doing
this
one
pair
thread
where
every
time
you
see
you,
the
client
is
coming
and
it's
and
it's
passing
inside
the
string.
E
So
you
check,
if
you
have
it
on
your
hash
table.
If
you
do,
then
you
immediately
translate
this
to
a
binary
index
and
then
that's
what
you
see
inside
the
osd
osd
should
never
see
the
string
afterwards.
If
you
don't
have
it,
then
you
create
a
new
entry
get
back
in
index
and
you
pass
and
you
start
using
this
so
that
way
you
get
few
things
right.
I
mean
you
shrink
the
size
of
your
data
that
compares
search
feature
and
you
don't
do
dynamical.
Okay
is,
for
example,
in
the
node.
E
All
the
attributes
that
are
using
hash
table,
I'm
not
very
familiar
with
the
way
c
plus
plus,
is
doing
hashtable.
I
know
many
years
ago
people
suggested
that
when
hashtag
got
very
few
entries,
they
should
just
be
using
array,
because
hashtag
are
efficient
when
you
have
hundreds
of
attributes.
But
if
you
got
just
a
few
of
them,
then
you
should
use
array.
I
don't
know
if
this
is
what
c
plus
plus
is
doing.
But
if
it's
not
the
case,
I
would
suggest
just
create
an
array.
E
I
mean
how
many
attributes
are
you
going
to
have
you're
going
to
have
10
20
of
them?
It's
probably
cheaper,
just
to
scan
the
array
looking
at
you
then,
and
then
then
running
hash
table
and
you
could
save
in
performance.
The
hash
function
and
hash
function
also
tends
to
have
these
ugly
things
in
which
the
memory
location
is
unpredictable.
E
Because
of
the
way
we
like
to
randomize
the
the
option,
then
you
you
tend
to
move
all
free
cash
into
dram,
which
is
very
expensive
and
say
you
have.
If
the
index
is
a
two
byte
index
and
say
you
have
32
entries
in
in
your
map,
you
could
easily
create,
like
a
eight
byte
value
with
four
times
the
index
and
then
run
the
compare
in
eight
bytes.
So
you
could
scan
32
entries
very
quickly,
you're
just
doing
eight
compares
and
then
inside
you
do
internal
compares.
E
So
I
don't
know
like
maybe
12
compare
the
most
and
that's
a
big
one
so
trying
to
use
more.
If
I'm
easy,
it's
like
it's
the
most
natural
thing
to
use
and
we
love
them,
but
performance-wise.
They
tend
to
take
more
memory
because
you
need
some
space.
You
keep
the
key.
You
keep
the
data,
you
do
dynamic
allocation
in
small
sizes,
so
there's
always
some
overhead
for
them.
E
So
go
like
the
first
things
coming
to
my
mind,
then
the
other
one.
For
the
o
note,
I
was
suggesting
that
if
we
can
identify
that
portion
of
the
o
node
fixed,
so
we
know
all
nodes
have
a
fixed
form,
fixed
part
and
then
there's
dynamic
part
which
can
grow
and
shrink,
so
we
could
use
released
with
own
node
from
the
fixed
size.
E
So
all
of
them
just
going
to
have
the
same
amount
of
space
and
then
the
last
entity
should
be
a
pointer
to
a
dynamically
allocated
area
and
you
could
use
it
using
body
system.
Then
you
could
allocate
the
part
which
is
changing,
but
the
part
which
is
not
changing.
You
don't
have
to
copy
again
and
again,
because
that
thing
you
could
do
in
place.
So
if
you
change
one
byte,
you
don't
have
to
copy
the
full
entity
and
another
thing
to
do
here.
E
If
all
your
data
structures
are
simple
memory,
data
structures,
you
don't
need
encode
index,
you
just
map
this
thing
into
binary
format
and
you
just
dump
it.
As
is
the
last
thing
for
the
pointer.
You
need
to
do
some
more
interesting
thing,
but
the
first
part
which
is
constant.
You
just
write
it,
as
is
you,
don't
need
any
any
formatting.
A
So
so
gary
this
is
a
little
old
at
this
point,
but
is
the
one
that
I
was
able
to
quickly
find.
This
is
a
I
pasted
a
wall
clock
profile
of
the
osd
under
a
4k
random,
right
workload.
A
This
is
kind
of
a
standard
case
where
we
really
stress
certain
parts
of
the
osd,
specifically
the
kvsync
thread,
and
it's
not
necessarily
definitive
that
this
is
you
know
where,
where
bottlenecks
creep
up,
but
it
it
so
far,
it
seems
like
it's
done
a
fairly.
Our
profiler
has
done
a
fairly
good
job
of
of
pointing
out
areas
where
potentially
we
can
make
improvement.
A
So
I
I
don't
know
so.
This
is
basically
time
spent
in
different
parts
of
the
code
under
a
4k
random
write
workload,
you
know
of
a
specific
osd
so
like
if
you
go
to
line
415.
E
A
1000
samples-
yes,
so
you
can
see
in
here
that
for,
for
instance,
we're
spending
a
fair
amount
of
time
in
wall
clock
time
in
the
inline
skip
list
in
roxdb,
doing
key
comparisons.
A
So
in
in
this
case,
this
is
because
we
have
such
large
buffers
in
roxdb,
where
they're
filling
up
with
so
many
keys.
That
is
taking
a
lot
of
time
for
rocks
to
be
to
actually
walk
through
and
do
comparisons
figuring
out
where
keys
should
live
for
ordering
purposes.
A
E
Mark
sorry
question:
is
this
code
trying
to
sort
all
the
the
key
data,
so
it
could
be
the
stage
sorted?
Is
this
trying
to
create
a
sorted
table
from
from
what
we
have
in
memory
before
before
we
write
it
to
the
disk.
A
So
yes
see
see
how
it
earlier
on.
It's
been
people
add
this
is
during
you
know
this
calm
family
insertion
and
we
see
a
mem
table
ad
called
online
430..
A
So
that's
basically
where
we're
iterating
through
ruxin
is
iterating
through
all
these
doing
key
comparisons
to
do
that
add
into
the
mem
table,
because
everything
is
sorted.
E
E
I
don't
know
who's
now,
the
blitz
classic
rdbm,
but
oltp
that
they
are
used
to
writing
small
entities
and
they
don't
need
to
sort
them.
I
mean
we
don't
care
for
sorting.
We
don't
need
the
meta
around
all
these
ones,
the
things
about
key
value
and
and
ls
entries.
E
They
are
great
if
you
have
big
object
and
those
objects
are
not
changing
very
often,
but
if
all
nodes
they
are
extremely
small
and
they
keep
changing.
So
you
keep
creating
more
and
more
version,
and
when
you
read
them,
you
have
to
visit
that
many
levels
until
you
you
construct
them
and
it's
a
tiny
object.
E
E
They
are
all
very
small,
take
some
changes
and
then
stage
them
did
we
try
comparing
roxdb
to
mysql
just
for
the
own
node,
not
for
the
not
for
the
object
itself,
but
for
the
oh
node
state.
A
So
sage
isn't
here
to
to
really
kind
of
give
the
the
what
he
was
thinking
during
this,
but
I
can
give
you
my
perspective
when
this
was
all
being
written.
I
think
that
sage,
really
at
the
time
when
blue
store
stores
being
written,
wanted
to
be
able
to
get
all
of
this
data
in
one
transaction
going
into
the
write
ahead
log
in
roxdb,
especially
for
the
possibility
of
being
able
to
do
deferred
rights.
A
So
you
know
back
then
there
were
ssds
around,
but
hard
drives
were
still
really
popular
right.
So
we
were
trying
to
figure
out
a
way
to
be
able
to
support
both
use
cases
and
for
hard
drives,
having
the
advantage
of
being
able
to
do
a
single
transaction
with
both
the
o,
node
metadata
and
potentially
also
if
it's
a
small,
I
o
a
small
amount
of
data
right
into
the
right
ahead.
Log
was
really.
E
Attractive,
but
I
think
this
is
what
we're
trying
to
do
if
with
systole
right,
I
mean
that
makes
sense
to
me
that
you
just
staged
the
metadata
with
the
data.
That's
perfect
made
sense,
but
separating
entities
on
any
entry,
I'm
not
sure
it's
the
best
fit.
A
I'm
not
sure
that
roxdb
in
the
long
run,
really
is
a
great
fit
either
for
a
lot
of
what
we're
trying
to
do.
But
having
said
that,
you
know
with
crimson
we're
changing
that
direction.
A
lot
right.
Yes,
I
think
the
question
right
now
with
blue
store
is
given
what
we
have
without
completely
changing
the
rocks
db
back
in
because
we've
actually
looked
at
a
couple
of
things
we
we
did,
try
replacing
it
with
lmdb.
A
I
actually
wrote
an
updated
version
of
another
lmdb
back
end
that
someone
else
had
made
a
while
back
and
it
was
slower.
We
couldn't
do
it
as
fast
as
roxdb.
It
probably
was
not
fully
optimal.
There
are
probably
things
we
could
have
done
to
improve
it,
but
it
the
the
early
prototypes
and
early
tests
didn't
really
show
a
significant
improvement
within
easy
implementation.
A
So
maybe
we
could
have
done
better
if
we
kept
on
working
on
it,
but
it
certainly
was
not
a
dramatic
improvement
just
with
a
basic
implementation,
mysql
or
postgres,
or
something
else
you
know
we.
We
could
try
something
like
that.
I
think,
though,
if
we
were
going
to
go
down
that
road,
I
I'd
advocate
we'd
just
write
our
own
thing
right,
specifically,
that
we
write
our
own
right,
headlock
and
whatever
we
have
seen
behind
it
is
you
know
its
own
thing,
but.
E
A
Well,
this
is
something
I
haven't
followed
recently,
so
yes,
probably,
but
I
haven't,
talked
to
sam
in
a
while,
so
you
may
know
more
than
I
do
at
this
point.
What
sam
is
working
on
with
it.
E
So
that's
t
store.
Is
writing
some
log
structure
file
system,
so
everything
is
flowing
there.
The
metadata
and
the
data
is
going
there,
so
you
could
still
do
the
sequential
right
or
moving
forward
with
zone
name
spaces,
but
at
least
I
hope
that
we're
not
going
to
do
separate
log
for
for
the
o
node.
E
C
A
So
there
was
a
prototype
from
intel
a
little
while
back
from
lisa,
where
she
was
trying
to
take
specifically
pg
log
and
just
write
it
out
to
bluefs
directly
without
range
roxdb,
so
not
exactly
what
you're
doing
gabby
but
burs.
Unfortunately,
it
didn't
handle
dynamic
sizing
it
just
like
allocated
like
a
very
large
block
and
then
hopefully,
the
pg
hug
entry
fit
inside
it,
and
that
was
as
far
as
that
one
went.
A
She
didn't
think
that
she
saw
a
significant
improvement
with
it,
so
it
kind
of
just
ended
up
being
abandoned
after
that,
but
I'm
not
convinced
that
we
really
gave
it
a
fair
shake
or
at
least
the
idea
of
trying
to
change
how
we
write
pg
log
entries
a
fair
shake.
That's
why
I'm
very
excited
about
what
you're
working
on,
because
I
think
that
there's
still
potential
and
possibility
there
to
to
really
improve.
A
What's
going
on,
whether
or
not
that
translates
into
something
even
bigger
in
terms
of
looking
at
how
bluestore
stores
data
and
how
we,
you
know,
write
ahead
log
and
how
we
actually
write
transactions
out.
I
don't
know-
maybe
it's
too
much
work
at
this
point,
but
certainly
this
is,
I
think,
a
rich
area
for
performance
improvement
from
the
the
traces
and
wall
clock
profiling
that
I've
done.
A
A
But
that's
that's
why
I
posted
that
that
that
profile,
just
because
it
might
might
give
you
kind
of
an
idea
of
at
least
where
we're
spending
one
o'clock.
E
Time
and
actually
sorry
one
thing
I'm
coming
back
to
something
I
said
in
the
beginning,
big
part
of
the
workload.
There
is
the
sorting
right
I
mean
before
you
create
a
run
before
this
stage
around
by
works
db
on
the
lsm.
You
must
sort
everything.
So
if
you
shrink
the
case,
the
sorting
is
going
to
take
it's
going
to
be
faster.
E
I
mean
it's
still
and
again,
but
at
least
the
constant
could
shrink
if
your
keys,
for
example,
if
we're
talking
about
rgw
and
a
key,
could
be
hundreds
of
bytes
of
of
string
and
the
compare
comparing
hundreds
of
string
byte
is
expensive,
but
if
you
make
it
to
be
a
binary
format,
but
if
you
compress
it
on
the
client
side,
you
will
never
see
the
full
key.
It
may
be
like
a
512.
Byte
of
key
could
shrink
to
64-bit
binary
compressed
format.
Then
comparing
them
could
be
much
faster
and
you
could
also
use.
A
Yes,
I
also
think
what
you're
doing
with
reusing
keys
is
going
to
be
very
important,
because
I
believe
that
if
we
do
that,
so
we
no
longer
have
tombstones.
I
think
we
can
shrink
the
buffer
size
for
the
mem
tables
so
that
we're
doing
compactions
more
often
with
smaller
amounts
of
data,
and
I
think
that's
also
going
to
help.
A
So
all
this
stuff
is
kind
of
tied
together
right,
like
oh
nodes
and
how
big
they
are
and
whether
or
not
they're
encoded
or
not
encoded
pg
log
and
how
many
pg
log
entries
we
have
coming
in
and
how
often
they're
tombstoned
and
when
that
happens,
and
how
we
we
have
transactions
hitting
the
right
ahead.
Log.
All
this
stuff
is
kind
of
interrelated
in
really
complicated
ways,
but
I
think
you
know
all
this
work
that
you're
doing
and
work
other
people
are
doing.
A
If
we
can
start
experimenting
and
finding
out
kind
of
what
things
affect
things
in
different
ways.
I
I
think
it
will
help
us
learn
what
what's
useful
and
what's
not,
and
what
we
should
be
doing.
B
I
think
that
the
idea
like
akavi
was
talking
about
of
potentially
having
oh
nodes
outside
of
roxy
b
is
pretty
interesting,
because
it's
not
just
the
cpu
overhead
there.
It's
also
like
the
amplification
medium
envy
application.
B
It
just
can
matter
a
lot
email
and
hard
disk,
and
when
we
have
these
tremendous
amounts
of
objects
like
for
rgw,
they
have
like
billions
and
billions
of
objects.
That's
tons
and
tons
of
o
nodes,
which
don't
necessarily
need
to
be
all
assorted
together,
like
that,
when
you
rw,
is
maintaining
its
own
index
and
for
the
oc's
purposes
listing
out
nodes,
doesn't
necessarily
need
to
be
a
low
latency
operation.
B
Might
be
worthwhile
to
try
some
kind
to
figure
out
like
if
there's
some
kind
of
perfect,
we
could
do
with
owners
outside
of
box
db
and
see
what
effect
that
has.
A
A
A
B
A
E
If
we're
using
zone
fs,
my
understanding
that
as
long
as
you're
writing
for
location
and
writes,
tends
to
be
fairly
quick,
even
qlc
drives
so
just
use
a
simple
log
and
flash
the
data
forward,
always
move
forward.
Don't
try
to
optimize
access,
I
mean
what
lsm
is
doing
the
way.
The
reason
is
sorting.
This
is
because
they
need
to
build
the
mechanic
and
the
sorting
immediately,
because
you
need
to
merge
levels.
E
But
if
you
keep
fighting
forward
and
it's
just
write
a
headlock
which,
every
time
you
wrote
something
safely,
you
could
put
a
checkpoint
and
you
can
discard
it
and
it
could
be
cyclic.
Then
you
could
just
do
something
simple,
but
again
it's
not
going
to
work
with
spinning
drive,
as
you
wrote,
as
you
said,
spinning
drives
you're
going
to
kill
them
doing
this,
but
zone
spaces
might
be
a
friendly
friend
yeah.
If
you
have.
A
A
dedicated
zone-
that's
for
this,
I
think
the
hope
with
roxdb
and
blue
storm
right
was
that
that
the
overhead
for
this
wouldn't
be
so
bad
on
on
flash
that
you
know,
we
could
still
do
a
reasonable
job
with
it,
but
you're
absolutely
right.
You
know
it
having
the
blue
store.
Design
is
still
kind
of
trying
to
have
the
best
of
all
worlds
right
like
let
hard
drives
still
be
fast,
but
make
nvme
faster
than
it
was
in
the
file
store
days,
and
it
is
right.
A
A
E
Because
there
is
nothing
the
same,
all
the
complications
because
been
retrieving
this
object,
but
we're
using
right
ahead
log
right
ahead:
log
you're
not
going
to
retrieve
them
in
most
cases
you're
never
going
to
read
them
again.
You
just
write
them
and
you
know-
and
once
you
decide
your
data
elsewhere,
you're
going
to
forget
them,
you
don't
need
to
build
fancy
indexes
for
them.
You
don't
need
all
kind
of
sorting
because
in
case
of
failure
scenario,
okay
you'd
go
back
and
you
would
read
them
one
by
one
and
nobody
cares.
It's
gonna
take
minutes.
E
Maybe
it's!
Okay!
It's
not
going
to
nobody
going
to
say
now.
Go
fetch
me
what
you
have
on
disk.
You
don't
need
it.
You
have
it
in
memory,
so
we
just
need
to
write
a
headlock.
We
don't
need
an
lsn
and
they
might.
They
might
be
even
right
ahead.
Log
right
ahead,
log
open
implementation.
We
could
borrow.
B
E
B
In
the
past,
like
for
file
store,
we
were
on
our
head
log,
essentially
this.
For
the
same
reasons,
I
don't
think
that
part
is
necessarily
that
complicated.
E
B
Yeah
yeah,
but
I
I
guess
I
do
want
to
point
point
out
that
with
bluestora,
I'm
not
sure
we
should
be
thinking
so
much
about
like
zone
namespace
or
super
fast
devices.
B
That's
kind
of
where
c
story
is
going
to
be
coming
in
in
a
say
a
year
or
two
once
it
comes
our
stable
and
that's
going
to
be
fully
optimized
from
the
ground
up
for
that
use
case.
So
I'm
not
sure
it
makes
so
much
sense
to
try
to
shift
the
store
in
that
direction.
So
much
as
improve
its
efficiency
for,
like
slower,
ssds
and
hard
disks.
A
E
Of
c-store-
and
so
I
mean
there's
a
tricky
thing
without
zone
name
spaces
if
you're
writing
short
right
one
after
another,
you're
going
to
create
crazy
white
amplification,
so
zone
namespace
was
built
from
scratch.
Sorry,
the
design
there
was
built
to
support
this
kind
of
white
logs
because
nobody
is
trying
to
collect
things
nobody's
trying
to
do
anything.
E
The
only
thing
I've
tried
using
many
years
ago
we
did
like
10
years
ago
when
this
is
these
were
still
young.
We
tried
to
do
the
the
translation
layer,
try
to
destroy
the
translation,
so
you'll
be
able
to
do
that.
So
I
don't
know
if
this
is
a
valid
option.
Can
you
say
you
know
what?
If
we
have
the
firmware?
E
Maybe
you
can
disable
the
translation
that
the
front
flash
translation
layer
and
then
you
could
do
the
right
ahead
right
ahead
log
and
there's
not
going
to
be
any
right
amplification
because
nobody's
going
to
coll
to
do
garbage
collection.
The
problem
for
us
is,
if
we
do
small
right
one
after
another,
and
the
garbage
collection
is
going
to
kick
in
they're
going
to
amplify
amplify
the
right,
which
is
not
what
we
want.
A
Just
in
terms
of
your
pg
lock
stuff
that
you're
doing,
though,
that's
a,
I
think,
a
a
very
valuable
thing
to
be
looking
at,
because
at
the
very
least,
it's
it's
something
that
might
reduce
the
amount
of
keys
going
into
the
database
significantly
and
delete
some
tombstones
very
specifically
tombstones
and
pg
log
is
kind
of
nice
from
the
standpoint
that
you're
right,
you
don't
go,
read
it
during
normal
operation.
It's
only
read
when
you
have
a
case
where
you
actually
need
to
read
it
after
osd
reboot.
A
E
E
We
need
to
compact,
we'll
have
more
and
more
stuff
to
compact,
because
naturally,
if
we
don't
recycle
keys,
then
usually
once
you
wrote
the
whole
node
a
few
times,
it's
not
going
to
change
again,
but
now
we're
just
going
to
change
them
again
and
again
and
again,
hopefully
they're
going
to
be
changing
memory.
But
I
don't
know
if
you're
going
to
create
an
extra
rounds
of
them
and
you're
going
to
see
in
every
level
of
the
lsm
you're
going
to
have
another
copy
of
the
object.
A
E
So
maybe
we
could
allocate
enough
memory
for
them
stay
like
I
don't
know
your
your
offer.
Space
should
be
big
enough
to
hold
3
000
keys
in
memory
that
you
can
always
have
them
all
in
the
same
level,.
A
B
E
You
have
to
divide
it
by
5
12.,
how
many
pg
group
12s,
I
know
not
all
of
them
going
to
be
active
in
the
same
time.
But
what's
the
number
of
active
pitch
group.
E
E
The
system
by
just
keeping
it
writing
everything
again
again
again,
maybe
the
best
thing
to
do.
I
I
think
I
need
to
see
how
I
recycle
the
numbers.
Maybe
the
best
thing
is
to
always
recycle
the
last
one.
First,
oh
the
first
one
I
don't
know
like
if
you
need
to
do
like
a
fifo
to
see
which
one
of
them
is.
B
A
This
is
this
is
the
why
it's
very
seductive
to
just
use
like
on
disk
ring
buffers
right,
yeah
just
get
rocks
together
picture
entirely
and
just
have
a
you
know
some
some
continuous
space
on
disk
that
you
right
into
for
this.
E
But
how
do
you
get
away
from
right
amplification?
So
I'm
again
going
to
say
this.
I
know
I
said
a
few
times
zone
name
spaces
might
be
our
best
friend
here
just
slide
forward.
There's
no
right!
Amplification
right
speed
is
good
enough.
I
mean
they
can
do
easily
twenty
thousand
iops
per
second
like
twenty
thousand
items,
even
if
it
is
like
a
small
ones.
So
it's
okay
to
do
all
these
small
stupid
rights.
A
One
thing
that
we
we
have
moved
toward,
though,
is
telling
people
that,
even
with
bluestora,
we
expect
that
if
they
care
about
performance,
they
should
have
some
kind
of
flash
in
the
system
or
write
a
headlock.
E
E
I
think
now
they
have
like
64
megabytes
of
of
battery
protected
memory
for
cash.
So
if
you're
using
just
one
of
them-
and
you
keep
turning
small
lights
it
to
finish
immediately
and
if
the
drive
can
put
everything
in
his
local
cache,
organize
them
and
then
send
them
and
we
never
moved
ahead.
We
just
move
one
direction
and
they
could
be
staged.
A
The
so
the
the
big
thing
there
is
right
battery
back
to
memory.
You
know
ram
if
you
have
it
and
even
down
on
flash
drives,
we've
seen
that
historically
in
the
older
days,
sometimes
you
would
have
a
super
capacitor,
backed
cache,
where
you
can
issue
a
flush
request
to
the
disk
and
or
to
the
the
drive,
and
you
can
immediately
respond
saying
it's
fine.
A
You
know
on
the
assumption
that
if
it
loses
power,
you
can
write
this
stuff
out
anyway,
but
we
even
saw
flash
memory
ssds
where
they
didn't
have
this
and
it
you
know,
a
flush
request
could
take
a
significant
amount
of
time.
It
was
very
slow,
so
in
terms
of
the
user
experience
and
the
kind
of
support
requests
we
got
sometimes
it
very
much
would
depend
on
whether
or
not
the
drive
actually
had
this
kind
of
battery
pack
cache.
A
So
if
we're
going
to
require
it
if
or
even
right
now,
I
guess
you
know,
we
see
a
very,
very
big
disparity
between
things
that
have
this
and
the
hardware
that
has
isn't
hardware
that
doesn't.
B
How
can
we
kind
of
reduce
the
memory
needed
at
like
for
nodes
or
for
booster
media
in
general,
like
eagle
or
atom?
You
probably
have
some
input
here,
because
a
lot
of
the
different
things
in
practice.
G
G
Well,
first
of
all,
in
my
fio
plugin
for
bluestore,
I
have
an
ability
to
simulate
pg,
lock
load,
and
I
did
some
investigation
on
that
a
couple
years
ago
and
well,
it's
still
possible
to
do
the
same
right
now,
but
I
recall
I
saw
around
20
to
30
percent
of
in
performance
improvement
when
disabling
pg,
lock
load.
G
G
G
G
G
If
we
talk
specifically
about
blue
store
in
a
no
or
not
structures,
then
probably
the
way
to
go
is
to
simplify
the.
This
is
to
reduce
the
the
amount
of
supported
features
that
reduce
the
flexibility
and
provide,
maybe
more
matured
solutions
say
for
immutable
objects,
or
things
like
that
well
very,
very
general
overview.
But
that's
what
I.
B
G
Yeah,
the
same
might
apply
to
to
small
objects.
Maybe
plenty
of
other
stuff
well
again
supports
for
for
deferred
operations
which
is
redundant
for
flash.
G
And
and
that's
why
one
probably
one
of
the
reason
why
roxdbs
or
kvs
or
is
used,
provides.
B
And
so
I
mean,
even
without
reducing
some
of
the
flexibility,
we
could
add
more
information
to
the
store
in
terms
of
like
the
instances
getting
from
clients
like
rvd
is
already
sending
hints,
but
it's
obvious
size
being
like
four
megabytes.
For
example,.
B
For
s
it's
similar,
it
could
perhaps
be
sending
more
hands
by
kind
of
that
same
kind.
What
its
expected
I
o
profile,
or
at
least
object,
object,
profile.
A
A
B
Maybe
maybe
we
could
plan
for
next
week.
I
think
we're
going
to
talk
a
bit
about
that
one
of
the
papers
next
week
as
well,
but
I
could
probably
take
less.
A
Sure
so
I
I
did
link
igor's
own
diet,
pr
in
the
chat
window
for
folks
that
are
really
interested
in
this.
That
might
be
a
good
thing
to
look
at
and
look
over
for
the
future
discussion.
I
also
can
link.
I
had
mentioned
the
the
double
caching
fix,
pr,
which
I
can
also
link
here.
A
I
do
think
that
probably
even
more
important
than
this
is
figuring
out
how
to
reduce
cpu
usage.
If
we
can
do
both
it's,
you
know
even
an
even
bigger
one,
but
I
think
it
will
be
easier
for
us
to
argue
for
more
memory
than
it
is
for
us
to
argue
for
more
cpu,
just
in
general
feeling.
I've
gotten.
B
F
A
E
A
That's
we're
we're
supposed
to
be
getting
some
we've
been
supposed
to
be
getting
them
for
the
last
six
or
eight
months.
Theoretically,
they
may
show
up
at
some
point
here,
but
covid
makes
it
tough
if,
as
soon
as
we
get
them,
though,
this
is
exactly
the
kind
of
thing
that
you're
thinking
about
doing
with
them.
G
Well,
I
have
got
some
experience
with
and
vimy
gyms
for
last
year.
Well,
half
a
year
and
I've
been
trying
to
to
implement
well
a
sort
of
rocksdb
replacement
using
mediums.
G
G
But
I
can't
say:
well,
I
did
a
bunch
of
benchmarks,
maybe
not
very
excessive,
but
well
some
of
them,
and
I
can't
say
well
I
I
I
have
some
additional
thoughts:
how
to
to
redesign
my
current
implementation,
implementation
and
things
like
that.
But
what
I
can
tell
for
now
is
that
using
limit
genes,
not
that
straightforward
in
terms
of
achieving
a
great
performance
so
right
now
I
don't
see
much
benefit
of
using
nvidia
mediums
comparing
to
roxdb
on
past
intel
devices
or
intel
flash
drives.
G
G
Auditorium
which
might
use
stef
and
this
store
so
right
now
there
are
pretty
high
requirements
on
on
the
hardware
which
supports
these
mediums.
This
probably
tends
to
change
eventually,
but
for
now
it's
it's
definitely
not
the
the
common
solution
and
it's
pretty
expensive.
D
Guys
we
do
intend
to
spill
this
discussion
to
next
week
right
yeah
yeah.
We
should
probably
wrap
it
up.
No,
no,
I'm
not
pressing
that,
but
is
there
a
limit
to
flexibility
of
targets
we
want
to
achieve
because
in
the
same
discussion
we
talked
about
spinners
nvmes,
obtain
pms,
and
I
mean
what
is
the
target
I
should
think
about
for
the
next
next
week,
because
now
it's
too
many.
D
Options,
reducing
cpu
usage,
fine
that
that
always
works.
But
what
is
that?
Do
we
reduce
functionality,
trim
some
unnecessary
logic
we
have
in
all
nodes.
Do
we
have
some
idea.
A
D
B
As
we're
discussing
a
lot
of
these
things,
though,
a
lot
of
them
like
in
terms
of
oh
no
structure
and
format
and
kind
of
what
what
if
we
could
make
things
more
more
specific
for
certain
use
cases
like
a
lot
of
that
those
same
ideas
could
apply
in
general
form
to
see
stores,
conception
of
outstanding
nodes
too.
A
One
thing
I've
been
wondering
about
for
a
while,
and
we
should
we
should
wrap
this
up
soon,
but
is
is
if
our
concept
of
storing
data
structured
in
an
o
node
actually
makes
sense,
especially
for
crimson
or
if
we
should
be
thinking
about
groupings
of
data
and
trying
to
apply
like
cindy
operations
on
multiple
parts
of
what
we
think
of.
As
an
o
note.
A
At
the
same
time,
like
you
know,
batching
batches
of
ios
trying
to
operate
in
ways
that
make
most
effective
uses
time
or
if
you,
you
know
abilities,
it
seems
like
right
now.
We
think
about
things
in
terms
of
objects
and
creating
objects
and
deleting
objects
and
intermediate
objects
and
translating
data
in
different
ways.
Copying
data
in
different
ways,
but
maybe
maybe
we're
trying
to
fit
all
this
into
this
kind
of
idea
of
how
this
is
supposed
to
work.
But
it's
not
really
the
the
way
that
is
most
efficient
anyway.
B
A
A
Adam
you
had
mentioned,
you
know
different
things
to
think
about,
regarding
you
know,
targeted
hardware,
for
blue
storage
store
and
and
what
how
to
think
about
this
stuff.
Do
you
want
to
update
your
document
that
you
have
been
working
on
with
kind
of
what
we've
been
talking
about,
and
you
know,
ideas
and
questions
that
next
week.
A
Okay,
maybe
if,
if
we
do
a
new
document,
maybe
we
could
create
a
a
new
thing
in
the
ether
pad
for
next
week
with
I've
been
I've
gotten
very
bad
at
actually
updating
discussion
topics
with
all
the
relevant
points,
but
maybe
we
could
put
together
something
in
here
that
kind
of
summarizes
all
the
questions
and
what
we
should
go
through
next
week.
A
So
continued:
oh,
no
discussion
all
right
so
yeah!
Anyone
that
wants
to
jump
in.
I
guess
we've
got
that
sub
for
next
week
on
the
ether
pad
so
feel
free
to
update
it
and
we
can
go
through
next.
A
Well
then,
this
was
a
really
good
meeting.
This
is
really
exciting.
Oh
the
discussion-
and
I
think
this
is
this-
is
stuff
that
we've
got
more
people
that
are
more
familiar
with
bluestora
now,
and
I
think
we
maybe
have
a
chance
of
actually
changing
some
of
these
things,
both
in
blue
store
and
then
hopefully,
you
know
the
the
the
maybe
more
more
exotic
and
more
interesting
changes
that
we
can
make
in
in
crimson,
so
excellent
job.
Everyone
looking
forward
to
talking
more
next
week.