►
From YouTube: Ceph Performance Meeting 2020-09-24
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
All
right,
I
will
start
with
the
pull
requests
and
people
can
going
on.
Let's
see
not
a
whole
lot
again.
This
week,
we've
got
two
new
pull
requests.
A
One
is
from
adam
other
adam
core
adam,
that
is
replacing
my
old
pr
to
allow
for
separate,
rocks
db
block
caches
on
a
per
column
family
basis.
The
idea
behind
this
is
that,
by
allowing
for
multiple
rock
cp
block
caches,
we
can
distinguish
between
the
block
cache
that
services
o
node
column,
families
and
omap
column
families,
and
by
doing
this
we
can
avoid
double
caching
ono's,
both
in
the
roxdb
block
cache
and
the
blue
store
o
node
cache,
while
preserving
cash
for
omap
entries.
A
So
my
old
pr
was
basically
in
place
before
we
did
the
column,
family,
sharding
and
needed.
You
know
a
refactor
and
rework,
and
since
adam
had
already
done
all
the
calm
family
work,
we
I
asked
him
if
you
would
mind,
taking
a
look
at
it
and
he
he's
re-implemented
it.
So
I
think
there's
still
a
little
bit
of
work
to
do
there,
but
hopefully
that
will
be
fairly
quick
and
we
can
get
this
in.
A
It
should
result
in
basically
much
better
cache
behavior
under
the
same
memory
envelope
in
the
osd.
Let's
see
other
the
only
other
new
pr
was
one
for
soft
volume,
retrieve
device
data
concurrently,
it
was
labeled
with
the
performance
tag.
I
don't
actually
know
very
much
about
this
one,
but
presumably
retrieving
data
concurrently
would
be
a
good
idea,
so
I
assume
that
that's
faster,
otherwise,
two
updated
prs
this
week.
There
is
a
pr
for
allowing
dynamic
levels
in
roxdb.
A
This
is
probably
a
good
idea,
but
could
be
kind
of
complicated
for
the
user
to
set
up
properly.
So
this
pr
just
basically
puts
the
makes
it
possible
to
do
this.
The
way
that
our
code
was
structured
before
you
really
couldn't
use
that
option
in
roxdb
with
this
pr
now
you
have
a
way
to
actually
try
to
do
this,
so
we
still
don't
know
if
it
actually
works.
Well,
we
still
don't
know
how
a
user
would
properly
set
this
up,
but
at
least
this
is
the
starting
point.
A
So
I
think
it's
a
good
change.
It
would
just
be
really
good
if
we
could
demonstrate
a
case
where
switching
over
to
dynamic
leveling
level
sizing
is,
is
really
beneficial.
Theoretically,
it
should
be,
but
we
should
show
it
the
other
updated
pr.
A
Is
this
d3n
cache
changes
in
for
rgw?
I
think
that
went
through
a
qa
test.
I
don't
know
if
it
actually
passed
or
not,
but
matt
did
a
couple
of
additional
reviews
on
it.
So
it
looks
like
there
may
be
some
additional
work
that
needs
to
be
done.
A
Then
we
should
get
that
in
you.
Have
the
the
poll
number.
B
B
B
B
A
Did
you
say
that
with
range
delete
or
delete,
ranger
you're
you're,
seeing
good
behavior
with
doing?
Is
it
extra
compactions
that
you're
doing.
B
Yeah,
so
the
idea
is
to
perform
range
delete,
followed
by
the
synchronous
so
well.
Actually,
the
idea
is
to
queue.
B
And
then
perform
this
range
delete
coupled
with
compaction
the
same
range
compaction
in
a
background
thread
and
also
this
compaction
thread
can
match
this
range
deletes
and
compactions.
So,
instead
of
multiple
operations,
it
performs
in
a
single
one,
pretty
often
and
well
for
now
I
can
see
so
this
means
that
we
don't
have
these
large
tombstones
in
roxdb
for
a
long
time.
They
are
compacted
shortly
after
the
indication
and
well
so
far,
I
can
see
pretty
nice
results.
B
D
D
B
A
Excellent,
okay,
anything
else,
guys
that
I
missed,
I,
I
missed
stuff
all
the
time
so
feel
free
to
chime
in.
A
E
Yeah
sure
I
mean
I
hadn't
read
the
much
billion
the
abstract
when
I
suggested
it,
so
I
I
don't
take.
I
don't
know,
let's
look
at
that,
but
let's
quickly
summarize,
essentially
they
look
that
we're
looking
at
trying
to
make
more
realistic
workloads
for
evaluating,
rocks
db
performance
and
they
make
traces
from
three
different
applications
at
facebook
that
ended
up
using
roxy
b
under
the
hood
that
did
fairly
different
things
with
it.
Some
of
them
used
like
multiple
column
families
in
different
ways.
E
E
So,
as
I
recall,
they
had
a
few
different
aspects.
They
focused
on
one
was
kind
of
key
locality
having
certain
hot
ranges
of
keys,
which
makes
a
lot
of
sense.
If
you
have
like
certain
prefixes
that
you're
using
in
different
ways,
they
have
the
concept
of
varying
intensity
in
terms
of
queries
per
second
or
operations
per
seconds,
which
exhibit
exhibited
a
strongly
diurnal
pattern
for
a
lot
of
workloads.
So
they
ended
up
modeling.
That
was
kind
of
more
more
based
on
a
sine
pattern,
rather
than
a.
E
Periodic
function,
and
essentially,
they
ended
up
comparing
the
workloads
they
generated,
based
on
fitting
the
different
characteristics
of
these
workloads
to
more
specific
models
for
each
of
these
characteristics,
with
a
generic
ycsb
benchmark.
E
The
more
the
workload
that
was
closer
to
their
traces
had
much
different
characteristics
that
more
closely
matched
what
ended
up
happening
at
the
storage
layer
in
terms
of
like
reapplication
and
right
amplification.
They
saw
then
the
more
kind
of
almost
more
randomized,
more
randomized
and
less
realistic.
My
csb
workload.
E
I
guess
there
are
two
aspects
at
least
that
that
got
to
me.
One
was
just
looking
at
the
trace.
Data
itself
was
a
bit
interesting
because
of
how
small
the
the
keys
were.
I
think,
like
the
largest
or
things
their
very
large
values
were
10k,
which
is
still
fairly
small,
compared
to
like
a
lot
of
the
keys
we
have
and
stuff.
E
And
I
think
the
concept
of
of
trying
to
very
closely
match
our
trace
to
with
a
synthetic
workload
using
different
models
for
different
components.
If
it
makes
a
lot
of
sense.
F
E
G
So
what
tool
were
we
using
so
far
to
to
benchmark
rocks
to
be
for
our
clothes?
Were
we
using
something
like.
E
Typically,
we're
using
higher
level
benchmarks,
rather
than
benchmarking
rocks
to
be
it's
directly,
something
that
like
going
through
our
one
of
the
high
level
protocols,
type
of
stuff,
even
like
rtw
or
rpd,.
A
Yeah
yeah,
we,
I
don't
think
anyone's
ever
tried
to
figure
out
how
to
simulate
an
osd
workload
such
that
we
could
just
run
it
directly
with
roxdb
in
a
standalone
way.
A
A
I
kind
of
question
how
much
you
actually
need
that,
like
in
the
paper
they
kept
on
talking
about
how
different
their
their
workloads
that
they
were
running
were
from
the
real
workloads
right
like
they
were
seeing
much
different
cash
hit
rates.
So,
okay,
yes,
of
course,
that
means
that
maybe
you
need
to
focus
more
in
from
a
performance
perspective
on
how
well
the
caching
is
working
or
maybe
you
need
to
focus
on.
A
You
know
some
other
area
depending
on
what
the
workload
is,
but,
as
I
finished
this
paper
and
read
through
it,
I
kind
of
was
left
with
this.
Well,
what
have
you
actually
accomplished?
You
know
kind
of
taste
in
my
mouth
like
okay,
you
know
they.
They
showed
that
they
can
figure
out
better
ways
to
represent
these
workflows.
Great.
Yes,
that
to
me
feels
like
the
very
first
step
in
what
should
be
a
much
bigger.
A
You
know
more
interesting
paper,
and
maybe
that's
what
this
is.
Maybe
they
are
going
to
do
that,
but
so
far
I
guess
it
felt
like
17
pages
was
a
lot
to
to
cover
what
they
actually
did
here.
A
Yes,
yes,
I
agree,
I
I
don't
want
to
totally
discount
what
they
did,
because
it
is
useful.
A
One
thing
I
would
have
liked
to
see
is
that
they
mentioned
like
ycsb,
using
zipfian
distribution
and
how
poorly
that
modeled,
I
don't
remember
which
workload
it
was,
but
one
of
the
workflows
that
they
had
that
didn't
it
didn't
it
wasn't
a
good
representation
of
it.
But
if
I
remember
right,
ycsb
lets
you
control
the
the
way
that
the
zipfian
distribution
is
laid
out
like
it.
It
lets
you
change
the
scaling
factor
of
it.
Basically,
and
I
didn't
see
any
reference
in
the
paper.
A
Maybe
I
missed
it
says
the
long
paper
but
or
dense
at
least,
but
I
didn't
see
any
reference
that
they
actually
tried
doing,
that
they
didn't
try
adjusting
the
existing
models
to
try
to,
like
you
know,
better
match
what
their
their
application
behavior
was.
E
I
thought
that
was
kind
of
what
they
were
talking
about
in
terms
of
fitting
the
models
to
their
applications,
behavior
like,
and
then
they
described,
trying
to
fit
different
different
distributions
to
the
to
their
traces
like
they
would
now
imply,
like
varying
the
zipping
parameters.
In
addition
to
varying
the
like
parallel
parameters
or
varying
the
parameters
of
other
distributions,.
G
I
thought
that
that's
what
they
did
in
section
7.3,
but
whether
they
were
comparing
them
benchmarking
results.
They
said
that
that
they
configured
the
ycsv
to
fit
the
zpdb.
G
G
A
A
A
F
A
H
As
far
as
I
can
remember,
I
think
by
csb
has
something
called
read
latest
and
you
can
also
you
know,
give
it
a
distribution
of
how
much
you
wanted
to
read
what
based
on
time
and
also
proportions
that
I
think
everybody
knows.
So
I
don't
see
that
being
used.
A
E
I
think
in
their
in
their
name,
they
had
a
very
brief
paragraph,
but
I
think
they
didn't
go
into
as
much
detail
as
that,
as
I
would
make
it
clearer
by
the
end
of
section
three
they
described
just
looking
at
the.
E
E
H
Yeah,
I
mean
I'm
not
super
familiar
with
it,
but
I
have
used
it
in
the
past
and
I
know
that
they
have
a
bunch
of
tunables,
that
you
can
tune
to
model
workloads
and
that's
the
reason
it's
such
a
popular.
It
has
been
in
the
past
at
least,
and
it
just
seems
like
they
they've
been
like
okay.
We
we're
not
going
to
experiment
too
much
with
this,
and
we
want
to,
as
mark
said
that
you
know
we
want
to.
You
know
just
create
something
that
works
for
us
better.
A
Yeah,
that
was
my
understanding
too.
I'm
not
super
familiar
with.
Why
csv
I've
used
it
a
couple
of
times
and
it
seemed
like
you
had
the
ability
to
like
tune
things
a
lot.
So
my
I
have
a
very
much
the
same
feeling
you
do,
it
seems
like
they.
Well
I
don't
know.
Maybe
they
really
did
look
into
it
and
maybe
they
they
determined
it
wasn't
sufficient,
but
it
would
have
been
nice
if
they
had
fleshed
that
out
a
lot
more.
E
Yeah,
I
suspect
there
may
be
some
pieces
here
that
are
kind
of
more
in
depth,
respect
to
like
the
keyboard
pair
ranges
and
sizes
they
mentioned
trying
to
contribute
some
of
that
to
icsp
in
the.
E
G
E
Analyzing
yeah,
the
tracing
stuff
has
been
actually
including
the
past
year
at
least
or
something,
but
I
don't
think
we've
tried
it
out
really
unless
you've
done
some
experiments
mark.
A
Yeah
I
mean
historically
my
my
view
on
a
lot
of
this
was
just
to
make
it
as
insanely
fast
and
easy
to
set
up
stuff
cluster
and
run
tests
as
possible.
But
presumably
you
could
go
this
way
right.
You,
you
create
some
kind
of
very,
very
or
maybe
a
variety
of
different
workloads
and
if
you
can
just
run
it
directly
with
db
bench
against
rocksdb
that
you
know
would
be
even
even
potentially
faster
and
more
useful.
Maybe.
E
Yeah,
I
guess
I
was
thinking
it
could
be
useful
for
like
analysis,
in
addition
to
a
kind
of
running
up,
any
more
isolated
benchmark.
I
A
Do
we
sorry
I
I
this
is
I
kind
of
started
honestly
tuning
out
a
little
bit
by
the
end
of
the
paper?
Did
they
give?
Did
they
give
kind
of
an
overview
of
what
all
the
tracing
is
capable
of.
A
A
Tracing
okay,
yeah,
I
see
here
under
on
page
212.
I
guess
which
is
the
same
thing
as
page
five
in
the
pdf
looks
like
that's.
Where
section
three.
A
I'm
a
little
surprised
that
they
said
that
the
the
lock
you
being
used
to
serialize
all
queries
doesn't
cause
any
performance
overhead
in
the
their
their
observations.
E
A
E
A
E
I
guess
another
note
that
was
important
for,
like
the
tracing
aspect
was
they
were.
They
looked
at
both
large
scale
like
a
14
day
period
of
time,
worried
analyzing
things,
but
also
decided.
E
I
treat
a
trace
over
every
single
day
as
well,
since
the
workload
workloads
will
change
quite
a
bit
over
that
period
of
time.
So
if
they
have
like
different
things
running
on
different
days,
I
guess
that's
an
important
thing
to
mine.
If
we
thinking
about
doing
tracing
in
the
future.
F
E
E
E
A
I
would
think
right
unless
you
start
really
changing
things
like
making
really
small
object,
sizes
or
something
I'm
guessing,
that
a
lot
of
rbd
workloads
are
going
to
look
a
lot
like
other
rbd
workloads
if,
if
they're
the
same
kind
of
the
the
the
same
kind
of
access
pattern
right
like
if
you
have
reads
happening,
you
know,
depending
on
how
much
cash
you
have
and
how
many
notes
are
being
read
from
cash
you're
going
to
have
more
or
less
rocks
to
be.
A
You
know
key
look
ups
for
those
o
nodes,
but
those
key
lookups
are
always
going
to
look
kind
of
similar,
I
would
think
is
that
is
that
a
good
assumption
or
my
problem
on
that?
You
think.
I
I
guess
you're
wrong,
mark
I've,
seen
workloads,
rbg
workloads
and
they
are
very
different
from
even
from
osd
to
sd
working.
Similarly,
at
the
same
time,
so
it
even
depends
especially
predominant.
Is
that
from
time
to
time
there
are
like
streaks
of
set
of
very
small
rights
it.
So
it's
basically
like
other
time.
It's
just
like
keeping
large
lights,
just
keep
coming
slowly
and
then
there's
a
fast
burst
of
small
rides
to
different
objects.
So
that's
what
I've
seen
and
the
patterns
are
really.
I
I
A
I
Of
course,
they
translate
a
bit
to
more
order
in
on
roxdb
side,
especially
because
we're
using
the
same
objects,
for
example
often,
but
it
couldn't
be
inferred
that
if
I
had
a
streak
of
small
writes
for
some
time,
then
it
will
continue
for
a
long
time
or
devolve
into
larger
rights.
It's
just
unpredictable.
Sometimes
it
abruptly
stopped
and
sometimes
it
turned
into
large
rights.
I
A
My
assumption
was
that
they,
if
you
had
an
rbd
workload,
you
would
always
see
some
amount
of
well
except
in
the
the
perfect
case.
You'd
always
see
when
you're
doing
writes
that
you'd
be
doing.
Oh
note
reads
to
make
sure
that
no
note
exists
and
potentially
you'd
always
see,
like
you
know,
fixed
size,
key
values
coming
in,
because
rbd
has
fixed
size,
keys
for
the
block
you
know
layout,
but
with
rgw
you'd
see
very
different
behavior
right
because
it
could
be
all
over
the
place.
A
You
have
many
different
key
sizes,
you
have,
you
know,
potentially
crazy
kinds
of
workloads
depending
on
whatever
the
client
is
doing.
I
guess
my
hope
was
that
rbd
would
be
more
consistent,
but
maybe
that's
wrong.
I
don't
know.
E
A
E
I
think
if
we
would
try
to
if
we're
trying
to
do
much
benchmarking
with
more
realistic
workloads,
the
idea
of
a
contrary
to
model
limit
more
fine
grain
might
make
sense
for
fao,
for
example,
I'm
not
sure
if
it
has
some
similar
kinds
of
options
in
terms
of
controlling
locality.
A
Can
really,
you
know,
try
to
better
model
what
it
would
look
like
if
you
were
doing
most
of
your
reads
out
of
cash
so
and
we've
done
that
in
the
past
you
know
it's
so
yes,
the
thing
I
don't
like
about
it
is
that
what
it
ends
up
doing
is:
is
it
really
kind
of
de-emphasizes
the
the
performance
of
stuff
for
the
impact
of
stuff
and
kind
of
more
just
focuses
on
how
fast
you're
doing
you
know
read
some
cash,
which
are
kind
of
completely
uninteresting
as
far
as
you
know,
actually
making
stuff
better?
A
Maybe
it's
interesting
for
the
the
customer,
because
if
they
can
say
well,
if
you
know
90
of
my
reads
are
coming
from
cash
anyway,
then
then
it
doesn't
matter.
If
the
storage
is
is
you
know
not
super
fast
or
slow
or
whatever,
but
yeah?
I
don't
know,
I
mean
from
a
perspective
of
giving
the
customer
a
valid.
A
E
Yeah,
I
think
it
sounds
like
you're
describing
in
general,
like
what,
when
you'd
want
to
run
like
a
benchmark,
a
more
realistic
workload
versus
when
you'd
want
to
do
a
more
targeted
benchmark.
E
A
Yeah
and
and
jens
has
actually
done
as
far
as
the
fio
and
the
block
layer
goes,
I
mean
he's
done
a
lot
of
the
things
that
they
talk
about
in
this
paper,
with
you
know,
being
able
to
record
traces
and
replay
traces
on
block
devices
and
and
even
fio
has
some
support
for
this
kind
of
thing
built
in
and
fio
does
give
you
a
lot
of
control
over
you
know
what
kind
of
access
patterns
you
want
to
create
and
and
layout
of
files
and
things.
A
So
I
mean
yes,
you
have
a
lot
of
a
lot
of
control.
I
don't
know
I.
A
A
Second,
okay,
so
this
is
the
result
of
actually
using
the
the
ml
perf
test,
the
nvidia
ram
to
to
showcase
performance
on
their
their
dgx1
ai
ml.
You
know
reference
architectures
that
they
built,
and
this
was
run
in
our
alias
lab,
which
has
very
slow
gpus.
A
It
took
forever
and
it
was
not
super
interesting,
but
now
the
gist
of
this
is
is
basically
you
see.
You
know
a
big
spike
of
reeds
being
not
even
that
big.
This
is
kind
of
pathetic
to
be
honest,
but
there's
a
spike
at
the
beginning
in
both
this
gnmt,
which
is
in
english
to
or
german
to
english
translation,
I
think
and
then
also
the
ssd
test,
which
is
a
an
image
workload.
A
I
don't
remember
exactly
the
the
exact
details
of
the
algorithm,
but
doing
some
kind
of
image,
recognition
and
work,
but
very
similar
in
both
cases.
It's
a
lot
of
cash.
It's
super
uninteresting
for.
A
A
A
Exactly
and
but
this
is
the
benchmark-
that's
used,
this
is
like
the
industry
standard.
Everyone
that's
doing,
gpus
is
for
this
kind
of
thing,
but
it's
completely
uninteresting
the
storage
side.
It
does
not
even
remotely
represent
what,
like
a
very
large
ai
workload,
that
real
people
are
doing
would
represent,
and
maybe
this
is
where
tracing
helps.
Maybe
this
is
what
we
use
it
for
like
right.
Maybe
maybe
this
is
a
justification
for
why
to
actually
do
this
is
that
getting
access
to
these
data
sets
is
is
really
really
difficult.
E
And
I
guess
the
commentator
in
the
past
has
been
a
difficulty
of
gathering
traces
from
like
large-scale
users
of
things
like
this,
but
if
we
can
do
that,
we
can
replace
them
on
even
smaller
scales.
A
To
play
the
devil's
advocate
a
little
bit,
though
say
for
this
ssd
test.
This
is
basically
going
and
performing
image
analysis
on
a
what
used
to
be
a
large
data
set
of
images.
Like
I
don't
know,
I
remember
how
big
it
is,
like,
maybe
maybe
upwards
of
50
gigabytes,
but
it's
something
like
that.
It's
small
enough
that
now
it
fits
in
memory
in
main
memory,
so
you
just
hit
page
cache
but
say
it's
50,
gigabytes
of
images
that
are
like.
I
don't
know
on
average,
between
64k
and
128k
in
size.
A
Do
we
actually
need
to
run
this
workload,
or
is
it
enough
to
say
that
you're
just
doing
random
reads
of
of
images
into
memory
and
for
whatever
memory
amount
you
have
you
know
you
you
can
you
can
do
you?
Can
you
have
a
certain
cache
hit
rate?
And
you
know
some
of
these
are
going
to
be
read
from
disc.
A
A
E
Yeah,
it
could
be-
and
I
think,
there's
there's
something
that
we're
closer
to
stretch
the
storage
where
it
is
interesting,
though
you
know.
A
A
What
I'm
saying
is
that
it's
is
is
in
you
know,
we
can
generate
load,
it's
not
that
we
won't
be
able
to
generate
load,
but
it's
just
that
all
this
may
end
up
being
is
how
quickly
you
can
load
files
that
are,
you
know,
moderately
large,
between
64k
and
128k,
in
into
memory.
Given
some,
you
know,
amount
of
of
page
cash
on
the
client,
and
you
know,
read
cash
on
sorry
buffer
cache
on
the
osd.
E
A
Even
if
you
had
a
large
data
set,
though
right,
all
it
might
end
up
reducing
to
is
how
quickly
can
you
read
files
in
a
range
of
this
size,
given
a
certain
cash
hit
rate
and
and,
and
you
know,
a
certain
amount
of
memory
and
a
certain
you
know
likelihood
of
hitting
the
same
file.
E
Yeah,
I
agree
that
I
mean
I
think
they
definitely
are
work
workloads
that
are
simple
enough
for
it,
that
you
don't
need
more,
more
accurate
modeling
to
get
a
general
idea,
a
good
idea
of
why
they
would
perform
sure.
E
I
think
that's
kind
of
one
thing
that
got
me
into
the
paper
like
rocks
to
be
is
a
fairly
complex
system
with
the
lsn3
and
and
the
different
things
it
does
to
try
to
manage
that.
So
that's
one
reason
why
more
accurate
workload
monitoring
modeling
might
be
quite
important
there
with
something
like
sure.
E
E
Yeah
yeah,
exactly
that's
what
I'm
getting
at
like,
but
that
also
increased
structure
is
very
complex,
whereas
a
single
object,
look
up
and
stuff
is
not
going
through
it
that
same
kind
of
structure
or
every
single
write
isn't
isn't
going
through
nearly
as
complex
a
I
o
path.
E
They
talked
a
lot
about
the
science
data
sets
and
the
different
sizes
of
keys
and
values.
I
don't
think
they
did
a
lot
of
analysis
on.
A
What
I
was
thinking
is,
if,
if
you
end
up
in
a
situation
where
the
database
itself
is
being
more
strenuously
hit
that
as
the
database
grows,
you
end
up
with
more
time
spent
in
compaction
and
more
likelihood
of
hitting
situations
where
you're
blocking
right,
like
you're
blocking
a
specific
level
as
being
compacted,
and
so
your
your
cash
miss
potentially
could
become
a
lot
more
severe.
E
A
E
E
E
We've
got
a
few
minutes
left.
Do
you
want
to
dive
into
the
other
discussion?
We
started
with
the
stand-up
with
the
guppy
in
there.
F
Okay,
so
what
we
try
doing
is
at
the
moment.
What
we
do
is
that
object.
Node
goes
to
roxdb
as
column
family
a
and
we
got
a
pg
log,
going
to
column
fundamentally
b
for
pg
log.
We
never
want
them
to
go
to
disk.
We
just
want
them
to
be
on
the
same
right
ahead
log
as
the
object
node,
which
is
why
we
put
them
in
in
works
to
be
in
the
first
place
so
and
the
problem
is
that
the
way
we
we
deal
with
them
is
that
we
create
log
entries.
F
But
when
the
object
node
reach
the
disk,
then
we
create
remove
a
delete
entry
which
becomes
a
tombstone,
and
these
things
tend
to
stay
forever,
because
usually
you
have
to
propagate
them
all
the
way
to
the
last
level
before
you
could
remove
them.
F
F
So
we
just
keep
updating
the
value
with
the
same
key,
but
we
never
remove
anything
the
and
we
and
we
hope
that
that
way,
we're
going
to
get
away
from
the
tombstone
and
all
these
problems,
because
in
theory,
because
our
main
table
is
going
to
be
very
small,
we
have
now
like
3
3000
entries
with
just
3000
entries.
We
assume
that
we
could
keep
the
m
table
in
memory
and
never
go
to
disk.
F
So
I
started
reading
a
rockstar
document
and
I
realized
that
our
design
is
not
going
to
work
the
way
it
is
because
what
happened
is
that
the
right
ahead
log
can
never
be
removed
until
all
the
mem
tables
which
assign
entry
there
being
this
the
disk
so
and
because
we
never
delete
anything.
What
happened
is
that
the
right
headlock
become
bigger
and
bigger
and
bigger
at
some
point.
What
happened
is
that
they
take
the
main
table
for
for
the
pg
log
and
stage
them.
So
everything
goes
down
and
they
go.
F
So
our
our
solution
was
doing
worse
and
before
and
now
my
understanding
is
that
if
we
want
to
get
performance,
we
need
the
right
ahead
log
to
be
deleted,
so
we
need
to
find
a
way
to
allow
to
be
deleted
and
still
never
get
without
average
staging
the
main
table
for
the
pg
log.
So
the
way
to
write
a
headlog
is
being
deleted
is
every
time
a
mem
table.
F
F
You
need
to
make
sure
that
all
the
main
table
that
have
copies
there,
all
of
them
being
all
of
them
being
distilled,
which
means
they
move
to
higher
version.
Does
it
make
sense
to
everyone?
Until
now,
I
actually
wrote
a
lot
of
things
in
my
email
today,
so
I
think
it's
going
to
be
easier
to
follow
next
to
you,
so
we
need
to
find
a
way
to
fake
the
mam
table
this
flash.
So
I
was
thinking.
Maybe
we
could
do
everything
which
is
done
for
the
main
table
flash
except
actually
flashing.
F
So
what
happen
is
in
the
main
table
flash
is
that
we
increment
sorry,
we
create
a
new
write
ahead
log.
We
increment
the
mem
table
sequence
number
and
somehow
the
the
old
right,
ed
log
is
being
updated
with
our
new
version.
So
he
knows
that
he
can
ignore
everything
from
us,
because
we
moved
all
the
old
versions
in
this
stage
and
now
it
only
have
to
wait
for
everybody
else
to
this
stage
and
report
that
they
move
to
the
new
level,
the
new
sequence
number,
and
then
this
thing
can
be
discarded.
F
So
I'm
trying
not
to
find
the
code
doing
the
flash,
and
hopefully
you
will
be
able
to
do
these
things.
But
it's
it's
it's
it's!
It's
a
dirty
hack
because
the
code
never
intended
for
this
to
happen.
I'm
still
trying
to
find
on
the
logic
which
could
allow
this.
Maybe
we
could
design
a
mod
which
say,
discard
main
table
discard
so
by
doing
mem
table
discard.
F
That
we
also
need
to
remove
the
main
table
which
might
not
be
the
end
of
the
world.
So
instead
of
using
a
single
mam
tab
mam
table
for
the
pidgey
log,
we
could
use
two
of
them
and
every
time
we
fill
one
of
them,
we
just
call
this
car
every
time
we
want
to
trim
everything
there.
We
can
call
this
card
and
then
that
will
immediate
emulate
the
flash,
but
without
actually
writing
everything.
Does
that
make
sense,
because
I
mean
the
other
way
is.
F
But
if
we
break
the
pidgey
log
into
two
separate
main
tables,
then
we
could
use
like
double
buffering
and
whenever
we
finish
with
one
of
them
and
everything
until
that
things,
we
know
being
this
stage.
The
other
object
node
that
we
open
pg
log
for
them
up
to
that
point.
In
this
stage
we
could
call
the
equivalent
of
trim,
but
we
could
say
discard
it.
Just
I
mean
there
must
be
a
way
for
people
to
say
you
know
what
forget
about
this
name
table.
I
don't
want
it
anymore.
F
So
if
you
don't
want
it,
you
could
add.
It's
called
the
write
ahead
log
and
you
don't
have
to
write
everything
so
the
whole
thing
being
freed,
so
I'm
still
trying
to
find
if
there's
a,
if
there's
an
actual
way
to
do
that,
because
that
would
mean
that
we
all
have
to
keep
creating
and
this
and
and
discarding
mem
tables.
F
F
F
Yeah,
so
that's
why
it
has
to
be
discarded.
Can
we
just
hacked
exactly
and
and
if
we
know
it,
we
can
even
create
a
mod
in
mainframe.
They
have
a
mod
when
they
do
write,
they
have
a
flag
to
say,
I'm
writing
to
temporary
file,
so
keep
it
in
memory,
but
never
put
it
to
disk.
I
need
to.
I
need
to
remember
the
way
that
the
name
for
this
there's
a
name
for
this,
and
so
we
could
maybe
call
it
like
it's
like
a
temp
file
like
a
tempe
fest.
F
F
F
F
Door
yeah
sorry.
So,
instead
of
using
the
skip
list,
we
could
use
the
hash
table
which
going
to
give
us
a
faster
access
for,
read
and
write.
Actually,
sorry,
not
not
a
just.
It's
going
to
give
us
faster
access
for
update,
read
and
write
and
whatever,
and
we
don't
need
a
skip
list,
but
that's
just
a
secondary
improvement,
but
we
need
to
have
this.
F
So
the
there
are
some
there
is
memcache.
For
example,
right
I
mean
memcache
doesn't
need
anything
to
be
persistent,
so
we
could
have
something
like
that,
like
a
temp,
temp
mem
table
which
it's
possible
to
get
to
be
discarded,
and
sometimes
things
get
discarded,
because
you
know
what
we
found
some
better
place
to
store
them,
we
move
them
to
somewhere
else
right,
I
mean
what
happen
if
your
mem,
if
your
mam
table
has
been
migrated
to
somewhere
else,
you
don't
need
to
store
it.
A
Yeah
I
just
I
just
wanted
to
mention
very
briefly
that
it
could
be
even
if
it's
a
side
effect.
It
could
actually
be
a
really
important
one.
F
A
A
F
A
What's
the
value,
the
question
is:
what,
in
my
mind,
is
you
know
historically
having
fiji
log
helps
you
avoid
the
situation
where
you're
you're
going
into
actually
looking
at
objects
on
disk
for
recovery
right.
F
E
A
E
Double
buffering
with
like
like,
if
we
potentially
even
don't,
don't,
have
as
much
of
the
vlog
itself
but
more
of
the
more
focus
on
the
dupes,
because
those
are
very
much
time
bound.
F
F
A
I
can't
probably
in
the
way
that
you
want.
I
can
I've
got
a
piecemeal
understanding
of
certain
parts
of
the
rockcv
code,
but
not
not.
You
know
kind
of
what
I
think,
you're
hoping
for
adam
you've
looked
at
it
as
well.
How
do
you
do.
F
Okay,
so
I'm
going
to
be
the
guinea
pig
I'm
going
to
try
to
do
it.
My
my
way,
so
it's
I
think
I
understand
the
semantic,
and
I
also
read
about
other
databases,
because
I
mean
write
ahead.
Logs
behavior
is
common
to
all
key
value
databases,
so
I
read
it
also
in
other
another
database
design
documents
and
also
mention
similar
behavior.
F
E
F
I
found
was
under
level
db.
Some
other
works
to
be,
and
there
was
somebody
else,
but
I
think
all
of
them
are
just
the
same
family
so
that
behavior
makes
sense.
I'm
still
trying
to
find
if
there
is
a
way
to
do,
perish,
discard
this
codman
table,
and
if
that
could
be
done,
then
I
don't
have
to
do
anything.
If
not,
then
I
have
to
add
this
code
to
do
mem
table
discard,
because
the
thing
I
suggested
in
the
email
is
is
pretty
ugly
right.
A
F
So
the
question
is,
and
I'm
I
cannot
make
any
any
judgment
call
here.
What
sam
told
me
is
the
reason
they
put
pg
log
in
works
to
be
is
not
because
they
cannot
use
another
writer
head.
Log
is
because
they
wanted
it
to
be
consistent
with
the
object
node.
They
want
it
all
to
be
in
complete
sync,
I
don't
know
it's
critical.
I
didn't
try
to
think
if
there
is
a
better
way
or
sorry
if
there's
another
way
to
do
it
while
using
separate
right
ahead.
E
A
B
E
F
E
A
I
Then
we
aggregate
writes
to
roxdb
and
on
certain
synchronization
points.
We
just
tell
that
we
need
that
that
batch,
that
we
already
have
written
to
land
safely
on
disks,
so
we
we
force
it
periodically.
I
mean
often
but
periodically.
I
F
F
We
could
start
every
iteration
that
somebody,
the
dispatcher,
is
poking
in
all
the
cues
and
you
can
say.
Oh,
I
can
see
there's
a
request
here
here
here.
So
he's
just
going
to
take
the
list
create
the
right
ahead.
Log,
send
it
to
bd
stage
and
only
let
her
go
and
start
processing
the
right.
So
he
could
do
it
ahead
of
time
really.
So
it's
going
to
be
write
a
head
ahead.
Log,
whatever
read
the
head
plus
right
ahead,
loose
yeah,
so
you
do
read
ahead
into
the
future.
I
I
don't
believe
we
can.
How
would
we
then
recover
from
sudden
loss
restarters
events
like
that.
I
I'm
not
sure
we
can
do
that.
How
if
we
do
it
that
way,
so
writing
as
a
future
to
our
all
the
rights.
How
can
we
then
synchronize
properly
now
we
just
blindly
trust
roxdb
to
budget
only
one
transaction
and
we
are
synchronized
just
because
all
of
those
mechanisms
that
are
already
there.
F
F
You
could
just
build
your
story
right,
I
mean
you
could
read
and
you
say
from
the
queue
I
could
realize
that
we're
going
to
have
that
many
rights,
let's
put
them
in
the
right
headlock
before
we
even
start
processing
the
right,
so
you
might
have
some
cases
in
which
later
you're
going
to
decide.
You
know
what
you
don't
need
to
do
this
right
because
of
some
reason
there
is
a
there
is
some
mistake
in
the
right,
but
it's
very
uncommon,
so
you
could
create.
You
could
send
a
tombstone.
F
You
could
just
do
like
an
undo
in
the
right
headlock
for
that
action,
but
in
vast
majority
of
cases
you
don't
reject
right,
I
mean
the
reason
to
reject
rights
is
usually
if
you,
if
you're
running
out
of
resources
or
if
there
is
something
wrong
with
the
request
permissions.
I
don't
know,
but
you
don't
expect
this
to
happen.
F
So
again
I
mean
I'm
still
not
really
sure.
I
fully
understand
what
is
the
semantic
of
the
writer
headlock
sam
was
pretty
hard
on
this,
that
he
want
consistent
view
system-wide
between
all
other
column,
families
and
object,
node
and
pg
log.
Everything
need
to
be
consistent
in
the
same
right
ahead.
Log,
I'm
not
sure
why
you
want
it,
but
you
want
it
so,
but
I
don't
understand
the
system
enough
to
say
we
don't
need
it.
Actually,
I
don't
understand
it
all.
F
So
there
might
be
an
extremely
good
reason
why
everything
have
to
be
together,
but
maybe
it
is
possible
for
us
to
batch
them
together.
Just
make
a
plan
of
what
we're
going
to
do
we're
looking
in
the
future.
We
could
poke
all
the
cues
and
we
could
construct
descriptions
of
our
future
walk
and
send
it
to
the
right
headlock
and
then,
in
our
long
good
time,
start
processing
them
and
then
do
it
from
time
to
time.
F
Every
time
you
reach
some
point,
you
start
looking
in
the
future
and
building
your
right
headlocks,
but
I
need
josh
and
sam
to
see
if
there's
any
merit
to
this
concept,.
C
Josh
sage
and
I
talked
about.
A
E
For
pg,
I
think
there's
some
existing
structures
that
aren't
specific
to
pd's,
like
the
snapmapper,
I
think
was
one
of
them.
E
E
But
I
think,
like
the
you
know,
I
guess
the
general
idea
with
the
watching
the
right
headlight
is
to
avoid
those
extra
ios
essentially
and
the
the
the
single
point.
Red
log
is
a
really
simple
way
to
keep
things
consistent
in
a
single
transaction
as
well.
It's
possible
to
do
in
other
ways
with
multiple
right,
headlogs.
E
F
E
E
F
But
the
dispatchers
are
working
on
a
prpg
right.
The
dispatchers
don't
see
all
the
future.
Every
dispatcher
just
see
one
pg
is
that
correct.
E
E
A
Right
now,
maybe
not.
We
know
that
in
the
past
the
kv-sync
thread
has
been
a
source
of
contention
and
actually
originally
in
bluestora
was
very
slow.
It
was
the
primary
bottleneck.
We
had
we've
improved
it
dramatically
now
we're
we
can
do
about
like
70
000
eye
ops
in
a
single
osd
before
the
kb
sync
thread
becomes
a
major
problem
and
we're
not
there
yet
for
a
real
cluster.
F
Adam,
I
think
we
could
remove
the
code
during
the
pg
logic
stage.
I
mean
just
don't
send
anything
to
the
pity
log
so
that
we
could
that
way.
We
could
see
how
expensive
is
the
pg
log.
D
G
F
So
we
could
measure
the
performance
impact
and
there's,
of
course,
we're
going
to
lose
recoverability,
because
if
we
don't
write
it
or
if
we
wait,
then
the
semantic
of
recovery
is
not
going
to
work.
But
we
don't
care
about
this.
For
now,
we
just
want
to
see
how
expensive
it
is
to
see.
Look
at
the
gaming
look
at.
F
A
A
F
A
Yeah,
yellow
is
to
to
give
us
the
upper
bound
on
what
we
could
do
if
we
also
improve
the
in-memory
processing
of
pg
log
right
like
if
we
could
in
you
know,
make
it
much
much
faster,
then
that's
the
upper
bound
of
what
we
get.
A
F
E
G
F
A
Question
well
yeah,
and
the
question
though
right
is
those
are
the
numbers
that
we
see
with
the
current
or
of
the
code
at
the
time
that
this
testing
was
done.
The
next
question,
though,
is
okay:
every
single
bottleneck
that
you
remove
everything
you
make
faster,
you
know.
Sometimes
it
could
be
that
we'd
see
more.
You
know
better
results
with
this.
If
there
wasn't
something
else
in
the
code
that
was
slow
right,
yes,
but
could
also.
D
E
I
was
asking:
is
it
worth
re-running
the
this
kind
of
experiment
again
with
today's
code.
A
Could
be
been
a
little
while,
since
we've
looked
at
this
stuff,
I
wonder,
though,.
F
A
F
F
F
Walking
is
just
tuesday,
so
that
gives
you
some
head
time
to
do
things
and
you
know
what
even
researching
rocks
to
beats.
It's
very
interesting,
so
I
don't
know
I'm
spending
time
there
worst
case
scenario,
I'm
just
going
to
learn
things
and
maybe
find
better
place
to
use
them
or
other
places
to
use
them.
A
All
of
this
is
kind
of
pointless
right,
because
crimson
is
the
real
future
in
making
all
of
you
know
this
kind
of
thing
good.
This
research
helps
us,
but
what
I
want
to
avoid
is
for
us
to
end
up
doing
a
bunch
of
work
studying
a
problem
that
we
may
never
really
get
around
to
fixing.
So
if,
if
we're
serious
about
really
changing
rocks
db,
you
know
like
really
changing
some
of
this
stuff,
then
maybe
we
should
do
it,
but
it's
a
lot
of
work.
I
think
you
should
know.
G
So
I
think
it's
I
think
it's
worth
evaluating
how
long
the
changes
you
talk
about
are
going
to
take.
A
F
I
C
I
Class
you
can
modify.
There
is
a
plugin
system
format
table.
You
could
could
insert
your
own
okay,
okay,
but
from
what
I
remember,
you
would
need
to
actually
include
a
lot
of
other
stuff,
because
there's
a
lot
of
dependencies
there,
but
you
can
make
the
the
top
level
of
mem
table
is,
does
have
an
interface.
A
F
The
version
is
updated
on
the
wall
on
the
right
ahead
log,
so
they
can
be
trimmed
implicitly
and
we
could
start
again
so
again,
I'm
hoping
that
this
thing
is
doable
but
interior.
It
sounds
like
it
could
be
done,
but
I'm
really
not
that
familiar.
Sorry,
I'm
not
familiar
at
all
with
folks
to
be
caught,
but
the
theory
makes
sense
to
me.
E
Yeah,
I
think
the
theory
makes
sense
and
I
think,
like
an
inside,
it's
worth
like
evaluating
how
much
effort
this
is
and
how
much
benefit
it
could
provide
like
some
of
that
again.
Some
of
the
other
things
we
talked
about
with
the
like
the
metadata
are
also
pretty
interesting
because
they
would
apply
both
to
booster
and
crimson.
G
There's
one
thing
I
wanted
to
mention
before
we:
we
wrap
things
up
and
that's
regarding
for,
like
maybe
next
papers
we
can
discuss,
there
is
there's
a
storage
systems,
so
systems
and
storage
conference
happening
in
two
weeks.
I
was
actually
supposed
to
happen
in
israel
by
ibm
research,
but
it's
happening
virtually.
G
I'm
gonna
link
it
here.
It's
obviously
free
registration
and
there
should
be
some
pretty
interesting.
Some
pretty
interesting
talks
there.
So,
okay,
can
you
send
it
by
email.
G
Yes
I'll,
send
it
so
I'll,
send
it
here
for
yes
I'll,
also
send
it
I'll,
also
send
it
on
on
email
for
anyone
I'll,
send
it
to
you
gabby
and
like
the
rest
of
the
team,
but
it's
also
here
in
the
chat
if
anyone's
interested.
Personally,
I
know
a
lot
of
the
people
organizing
this
event,
so
I
I
and
I
looked
at
some
of
the
accepted
papers,
so
it
should
be
pretty
interesting.
So
if
you
guys
are
interested
in
that,
that's
happening
on
the
13th
through
the
14th
of
october.
A
D
A
And
then
maybe
if
people
attend
any
presentations
there,
that
are
particularly
interesting,
we
can
add
it
as
discussion
topics.
G
For
future
meeting-
yes,
I
I
I
plan
to
to
attend
at
least
at
least
one
of
the
one
of
the
days
so
yeah,
but
I
figured
like
if
anyone
else
is
interested,
it
seems
like
it's
in
pretty
convenient
times
for
both
people
from
the
us
as
well
or
europe,
us
even
china,
so
yeah.
G
F
F
I
But
as
a
last
note,
I
thought
about
what
gabby
was
saying
about
skip
lists.
I
think
we
do
have
some
prefixes
that
we
never
actually
iterate
through
them.
So
like
different
rights,
maybe
we
could
really
drop
a
skip
list
mem
table
for
that
and
just
go
for
some
hashtag
never
thought
about
it
before,
but
that
seemed
feasible.
E
A
Yep
comparisons,
key
comparisons
to
maintain
ordering
is
one
of
the
highest
things
I
see
in
our
wall.
Clock
profiles
regarding
the
kb
sync
thread
and
right-hand
log,
so
that
potentially
could
actually
be
a
big
win
for
us.