►
From YouTube: CDS Infernalis (Day 2.2) -- OSD: Newstore
Description
Videos from Ceph Developer Summit: Infernalis (Day 2.2)
04 March 2015
https://wiki.ceph.com/Planning/CDS/Infernalis_(Mar_2015)
A
Hundred
as
a
as
a
asum,
a
house,
alright
I
think
we're
on
to
the
next
one
here
sage.
This
is
your
new
store
discussion,
the
new
back
end
USD,
so
go
ahead
and
take
it
away
think
we're
down.
That's:
okay,.
B
B
So
there's
a
work
in
progress
branch,
that's
linked
in
the
blueprint.
It's
it's
Board
of
working
right
now
you
can
do
like
reads
and
writes
and
attributes
Oh
map
isn't
implemented
yet
I'm,
sobbing
tap
and
last.
It
does
has
mostly
infrastructure
right
now,
except
that
right
now,
when
you
submit
a
transaction,
a
lot
of
the
work
is
done.
Synchronously
I
haven't
sort
of
split
it
off
into
it.
Asynchronous
completion
thread
stuff
yet,
but
the
basic
facing
machinery
is
there
and
yep.
B
So
anybody
who's
interested
in
this
should
look
at
it
and
possibly,
if
you're
looking
at
creating
sort
of
a
new
back
end.
This
might
be
a
good
thing
to
look
at
I
know
a
bunch
of
people
are
looking
at
using
the
key
value,
store
and
layering
that,
on
top
of
a
key
value,
database,
I
think
this
is.
This
is
going
to
be
simpler
for
a
lot
of
workloads
and
it
might
make
for
an
interesting
counterpoint,
I
guess
to
whatever
you're
building
so
I'm,
just
FYI,
so
I
guess
the
current.
B
Let's
see
what
should
I
wish
I
talked
about,
so
the
basic
idea
is
that
when,
when
a
right
comes
in,
if
it
is
a
new
object
or
if
it's
an
append
to
an
existing
object,
then
we
just
open
up
open
a
file
or
open
the
existing
file
or
create
a
new
file
and
just
write
out
the
data
and
append
to
it
and
then
in
order
to
commit
the
transaction,
will
f,
sink
that
and
then
we'll
as
soon
as
the
f
st.
completes
will
commit
a
transaction
to
the
key
value
database
sets.
B
As
you
know,
the
size
is
no
new
or
the
object
now
exists,
and,
and
at
that
point,
that
the
right
is
complete
and
that
that
that
currently
works
and
that
works
for
a
pens
and
for
creates
the
file
that
it
creates
is
like
a
really
simple
file
name
but
there's
a
short
as
possible.
It's
just
a
fresh,
I
know'd
and
all
the
mapping
to
get
to
that
file
is
happens
in
the
level
to
be
thing.
So
we
minimize
past
reversals
on
the
right
side.
B
It
also
caches
the
inode
number
in
the
Keith
I
database,
so
eventually
will
actually
be
able
to
use
the
open
by
handle
interface,
actually
I'm,
not
that
isn't
funny
yet
and
we're
actually
not
doing
it
quite
right,
but
the
idea
is,
then
you,
if
you're
on
XFS
or
actually
any
file
system.
For
example,
you
give
the
colonel
a
handle
to
open
the
file
by
the
inode
number,
basically
or
the
handle
instead
of
traversing
the
file
main
path
I'm,
which
will
speed
it
up
a
little
bit
so
that
I
think
should
be
pretty
good.
B
The
thing
that's
that
is
implemented
now
is
a
basic
right
ahead
log.
So
there
are
cases
where
you
can't
just
do
an
append,
for
example,
if
you're
overriding
some
bites
within
the
file
or
if
you're
deleting
a
file,
you
have
to
commit
the
transaction
to
the
key
value
store
and
then
go
back
and
delete
the
file
sort
of
after
the
fact,
and
so
there's
a
generic
right
ahead,
lock
thing,
basically
where
when
we
commit
the
transaction,
we
also
commit
that
there's
this
pending
operation.
B
C
Yeah
yeah,
it's
got
a
true
yet
so
we
put
Janis
Ian
after
the
woods
action
compared
to
faster
now
and
I
think
it's
may
increase
latency
before
festo
and
because
follow
the
district
session.
Now
most
almost
a
single
up
here
and
it'll,
what
a
block
but
I
Sinclair
it
can
be
in
order
and
it
it
may
increase
the
latency
lamb
before.
B
Sorry
that
which
part
is
the
fact
that
the
transaction
is
being
done.
Synchronously
is
that
we
said
yeah
yeah
yeah,
so
that
that's
that's
currently
the
case
in
the
Dakotas
posted,
but
that
isn't
sort
of
the
goal.
So
eventually
what
so
right
now
it's
synchronously
actually
doing
this
transaction
blocking
in
queue
transactions.
What
it's
going
to
do
is
actually
just
cue,
the
transaction
to
hand
it
to
the
the
key
value
store
and
then
a
different
thread
when
the
commit
will
will
trigger
the
completion.
C
Yeah
yeah
yeahs,
Iowa,
yeah
I,
want
to
know
the
actual
reason
about
why
we
put
a
Genesis
after
the
actual
translation.
Why
not?
We
follow
the
first
or
a
character
environment
and
we
can
do
a
move
to
some
more
jobs
before
channel
such
as
we
can
with
ESP
for
skip
dunno.
If
we
meet
some
conditions,
I
mean
that
so
we
can
do
some
more
simple
jobs
before
Jenna
scenes
and
the
weekend
put
actual
transactions
after
the
Genesis.
B
Not
quite
sure,
I
follow
that
so
you're
saying
that
in
start
with
the
current
file
store
and
just
skip
the
journaling
step
in
certain
situations.
Or
do
you
mean
that
you
want
to
cue
the
work
asynchronously
in
a
new
store
so
that
the
work
doesn't
happen
Singh?
So
you
minimize
the
synchronous,
work
yeah.
B
Yeah
yeah
yeah,
so
that's
that's
that's
true,
so
I
think
the
assumption
is
that
and
if
you're
ever
submitting
a
right
on
an
object,
you,
the
OSD,
has
already
just
standard
the
object
and
read
it
so
it
should
be
in
the
cache.
So
that's
that's
the
first
assumption,
so
we
should
never
block
on
the
ghetto
node
piece.
Yeah
it'll
already
be
there,
the
other.
B
The
other
thing
I'm
kind
of
trying
to
do
is
put
as
much
stuff
sort
of
in
the
synchronous
q
transaction
as
we
can,
because
the
problem
that
we're
seeing
on
the
high
performance
like
on
the
SSD
flash
stuff
is
that
if
we
keep
switching
thread
contexts
all
the
time,
then
it
just
it
just
sort
of
falls
apart.
So
at
my
general
thought
is
at
the
very.
The
first
implementation
will
just
sort
of
put
all
the
things
that
we
expect
not
to
block
in
queue
transaction
directly.
B
If
we
find
that
that's
a
problem,
then
we
can
have
an
option
where
it
sort
of
pushes
that
off
on
to
accuse
it
up
and
then
something
else
does
it
may
be
for
slower
disks,
that'll
that'll
work
better,
but
I'm
I'm
kind
of
guessing
that
it.
If
we're
sort
of
careful
about
that
it'll
it'll
it'll
work,
okay
for
disks
to
but.
D
D
B
The
main
example
is,
if
we're
going
to,
if
we're
going,
to
queue
right
to
the
to
the
file,
if
we're
doing
buffer
do,
and
usually
that's
not
going
to
block,
and
so
we
can
just
do
it
synchronously
if
we're
doing
direct
I.
Oh
then
we
can,
we
can
start
an
aio
operation
and
that
also
generally
won't
block,
and
so
that
can
be
done.
B
Sickness
Lee
if
it
does
block
it's
because
they're
the
there's
memory
cache
pressure
and
we
need
to
slow
down
anyway
and
that's
the
same
case
where
our
current,
the
current
stuff
we
have
in
file
store.
Where
does
all
the
throttling
where
that
would
be
slowing
us
down
and
so
I
think
it's?
We
can
just
sort
of
skip
all
that
complexity
and
just
let
the
colonel
backpressure
do
that
for
us.
I
make
sense.
C
C
C
D
Reading
this
is
cute
induction
one
right
so
I
can
share
my
findings.
So
right
now,
with
the
p-value
story,
remove
this
this
Red
Bull
stuff-
that
is
actually
a
simple,
mostly
submitting
the
transition,
so
that
is
Stanley
giving
me
actually
th
performance
benefit.
Well,
yes,
actually
with
defects
kind
of
fiction
becoming
the
kids
performance
benefits
from
the
SSD,
so
we
are
not
sure
that
our
dicks
from
what
actually
will
be
bringing
so
another
bhushan.
Is
that
so
are
you
actually
committing
the
transition
when
it
is
means
and
at
what
stage?
According
the
Spanish?
Yes,.
B
So
it's
confusing
because
the
current
code-
that's
posted,
does
it
synchronously
in
queue
transaction?
It
actually
commits
it
too,
but
that's
not
what
is
actually
going
to
do.
It's
gonna
it's
going
to
cue,
the
transaction
and
return
and
then
we'll
have
a
call
back
or
something
that
will
wake
up.
I,
wasn't
sure
exactly
how
the
key-value
DB
interface
for
that
currently
works
or
how
it
needs
to
be
changed.
It
kind
of
looked.
I
don't
know.
D
B
B
So
for
smaller
rights,
it's
different,
so
for
so
that
the
and
what's
currently
written
and
sort
of
what
currently
described,
is
handled
sort
of
the
DIA
pens
and
the
small
overwrites,
where
we
do
the
write
ahead.
Logging,
there's
sort
of
a
second
idea
that
that
we
were
discussing
and
the
idea
there
is
that
if
you
have
small
overwrites
instead
of
going
through
all
the
overhead
of
doing
a
right
ahead
transaction
and
then
after
it
after
you
commit
it
writing.
B
In
the
background,
we
should
just
store
the
small
rights
in
the
key
value
store,
also
and
just
have
some
bound
on
the
number
of,
like
small
extents,
that
we're
storing
in
the
in
the
key
value
database.
So
maybe
once
we
hit
like
ten
extents
ten
random,
4k
iOS
that
are
restoring
in
the
key
value
database,
we
would
say:
okay,
fine!
Now,
let's
go
take
them
all
together
and
we'll
update
the
file
in
the
background,
but
it
sort
of
batch
them
and
so
that
you
would
do
them
all
at
once.
D
B
That
the
thought
would
be
that
like
right
now,
all
the
all
the
key
names
for
the
own
oats
are
like
Oh
underscore
in
the
object
name
and
the
all
the
inline
data.
I.
Guess
these
these
small
extent
that
we're
storing,
directly
nikki
I
debase,
would
be
stored
and
there
are
different
prefix
so
that
they
would
not
sort
of
pollute
the
part
of
the
key
value
namespace,
where
we
want
to
have
the
right,
cache,
affinity
or
whatever
and
and
and
so
all
the
all,
the
extents
for
a
single
object.
B
D
Since,
since
you're
doing
this,
it's
think
and
what
you
using
this
Buffalo
I
think,
except
this
will
actually
help
you
here
on
that
aggregating,
these
polar
rights,
because
that
so
do
you
see
that
email
scale
versus
what
says
despise
for
information.
So,
even
if
there
is
a
fish,
all
right,
yeah
flicks,
lighter
item,
but
but
this
mic
for
second,
so
that
I
all
studies,
rebooting,
is
much
less
in
case
of
file
store.
Then.
B
So
right
so
we're,
so
there
are
a
couple
things
that
I'm
hoping
will
work.
So
one
is
that
a
lot
of
the
stuff
that
we're
putting
into
the
key-value
database
for
it,
sort
of
its
we're
kind
of
assuming
a
log
structured,
merge
tree
like
rocks
to
be
her
level
to
be
a
lot
of
the
stuff
that
were
sorta
streaming
out
to
the
the
rocks
to
be
log
is
either
the
0
notes
that
eventually,
yes,
we'll
get
condensed
and
compressed
compacted
on
the
back
end
and
that's
fine
or
it's
going
to
be
the
right
ahead.
B
So
the
immediate
right
is
a
single
I/o,
essentially
for
that
small
io,
because
the
o
note
in
the
end,
the
small
data
are
right
next
to
each
other
effectively
in
the
Kiva,
the
osm,
the
key
value
stores
journal
and
then
later,
when
we
do.
The
second
right
to
the
object
will
have
a
bunch
of
them.
That
will
do
it
once
so.
Hopefully
we
can
exploit
the
locality
on
the
disk
and
they'll
be
only
one.
B
I
know'd
update
in
X
of
s
if
there
is
one,
if
we're
lucky,
we'll
be
able
to
tweak
XFS
so
that
it
doesn't
update
end
time.
That
would
be
apparently
there's
an
I
octal.
If
you
call
the
exif
SI
octal
instead
of
the
open
by
handle
generics
disk
all
there's
a
flag,
apparently
that
you
can
pass
to
it
that
skips
m
time
updates,
so
we
can
sort
of
avoid.
As
long
as
the
files
already
allocated,
we
can
avoid
any
any
extra
X
of
s
overhead.
B
I'm
hoping
that
a
combination
of
all
those
things
will
end
up
with
like
a
pretty
decent
picture,
both
for
or
the
small
is
where,
instead
of
having
to
X
write,
amplification
you'll
have
one
point
one:
if
we're
doing
it,
10
extents
in
line
or
something
like
that,
plus
the
apps,
I
guess
yeah,
also
the
application
that
happens
in
the
end.
The
key
light
database
to
you
can't
ignore
that
yeah
yeah
but
it'll
be
better,
at
least
in
what
we're
doing.
I'm
pretty
sure
it's
going
to
be
strictly
better
than
what
the
file
stories
doing.
B
C
But
but
I
think
we
have
some
more
the
vantage
before,
because
we
delays
over
the
back
and
the
four
sample
of
a
hot
photo
photog
your
case
it
may
ever
to
make
good
yeah
it
make
good
for
that's
right
forgot.
We
may,
we
may
read,
what's
also
cool,
only
small
bronze
and
a
weekend,
that's
right
for
this
director
in
1
I've
seen
here
and
now
we
may
show
f
scene
for
each
time
and
it
may
increase
the
more
rifle
cool
disco,
yeah.
B
So
I,
but
I'm
yeah,
it's
it
it's
true.
It
might
be
that
that
that
the
doing
a
big
sink
of
s,
head
of
large
granularity
is
going
to
schedule
the
iOS
better
than
doing
all
these
up
sinks.
On
the
other
hand,
it
might
be
that
the
fact
that
the
the
journal,
the
new
journal,
is
effectively
just
a
file.
That's
the
key
value
databases,
log.
The
fact
that
it's
not
pre-allocated
means
that
XFS
sort
of
has
the
freedom
that
it
has
all
these
small
things
that
it's
writing
it
could
deallocate
or
smart.
B
It
can
just
lay
them
actually
lay
them
out
sequentially
and
combine
so
to
actually
have
you
have
a
pretty
good
io
pattern
on
disk
I'm,
hoping
it
ends
up
being
reasonably
smart
and
if
X
of
s
doesn't
I
think
butter
FS
certainly
is
really
good
about
laying
out
lots
of
iOS
sort
of
sequentially
as
it
as
it
lays
them
out
on
disk,
but
hopefully
that
will
help.
The
other
thing
that
we
have
going
for
us.
Is
that
the?
B
B
B
Is
that
I
need
to
make
omap
work
and
and
need
to
change
the
current
transaction
submit
so
that
it's
not
synchronous
that
it's
actually
doing
it
asynchronously,
so
the
it
isn't
super
slow,
but
once
we
do
that,
I
think
it'll
be
in
a
state
where
we
can
actually
start
benchmarking
it
with
a
couple
with
different
workloads
right,
so
we
can
focus
on
just
giving
it
a
bunch
of
sequential
I,
oh
and
see
if
it
will
effectively.
You
know,
use
the
ball,
the
disk
bandwidth.
You
actually
get.
B
You
know
eighty
ninety
percent
of
what
that
just
can
do
out
of
the
out
of
the
drive
instead
of
fifty
percent
or
less
that
we
do
now,
then
we
can
throw
you
know
small
iOS
at
it
and
make
sure
the
the
right
ahead.
Logging
is
look
at
the
IO
patterns,
I
just
generating
and
see
if
that
makes
sense,
or
whether
we
should
do
the
inline
dative
thing
in
the
key
value
database
and
if
and
then,
whatever
yeah
so
basically,
I
want
to
get
to
a
state
where
we
can.
B
E
Like
I
guess,
I
was
specifically
thinking.
Is
there
any
good
way
for
us
to
tell
with
the
new
work
that's
going
on
what
kind
of
store,
or
we
might
make
sense?
Does
it
yeah
I.
B
Think
I
think
we
can
sort
of
hypothesize
what's
going
to
work
best
and
then
just
try
it
all
right
sold
will
try
level
to
be
will
try
rocks
to
be
will
most
likely
find
that
recipe
is
better
than
level
to
be
I'm
guessing
but
I'm,
not
sure.
Maybe
that's
different
on
disks
versus
SSDs.
We
can
try
LMK
DB,
which
we're
going
to
talk
about
I.
Think
in
a
moment,
I'm
curious
how
that
works.
But
yeah.
We
don't
we'll
see
what.
E
B
Fine,
so
there's
that
there's
a
question
from
Prada
in
the
IRC
channel,
can't
we
use
a
directional
sink
on
the
back
end
instead
of
I've
sink.
Definitely
so
I
think
that
the
goal
is
to
do
either
one
of
the
other
and
I
think
that
we
would
decide
based
on
whether
we
have
a
hint
that
we
don't
want
to
cash.
B
So
if
they
don't
cash
into
set
or
maybe,
if
it's
just
really
big,
we
can
sort
of
assume
that
it's
sequential
and
we
don't
need
to
cash
it
as
much
and
then
we
can
doooo
directo
sink
and
if
not,
then
we
can
do
buffered,
I,
oh
and
actually
explicitly
call
us
think,
I
think
it
doesn't
matter
that
much
because
specifying
the
0
sink
flag
on
the
file
handle
as
effectively
just
means
that
after
you
do
the
right
it
does
it.
It
does
the
same
thing
that
I
think
does
RF
data
sync.
B
C
Yeah,
I'm
not
sure
we'll
go
see
my
people
about
2.2
method
for
fussing
cost
rather
like
for
faster
on
your
new
store
yoku.
She
does
is
the
method
for
simple,
just
alike
database.
What
a
bit
database
carrier
do
Akamai
circle
it'll
use
to
appoint
a
messenger
to
flash
the
in-flight
hall
director,
which
is
enjoyed
a
lot
in
a
disco
I.
C
C
B
Yeah
yeah,
so
the
current,
the
current
thing
that
so
in
for
larger
Vittorio's
or
a
pens
large
depends
at
least
we
want
to
just
avoid
any
right
back.
It
should
just
do
it,
so
we
avoid
the
double
rights
completely,
but
so
ignoring
that.
If
we're
talking
about
small
is,
then
we
have
this
right
ahead.
Log
right
now
that
the
basic
thing
is
just
as
soon
as
the
transaction
commits
it'll
in
the
background,
it'll
just
go
into
the
right,
a
blog,
but
we
could.
B
We
could
definitely
like
try
to
batch
that
work
up
or
delay
it
as
for
as
long
as
possible,
so
that
we
sort
of
are
doing
doing
aggregating.
Multiple,
writes
the
same
object
for
example,
and
so
forth.
I
think
that
would
definitely
be
something
we
could
experiment
with.
I'm,
not
sure
if,
like
a
full
check,
point
type
game
makes
sense,
because
it's
because
I
should
we're
doing
different
things
for
different
objects,
so
there's
no
longer
sort
of
a
global
view
of
of
a
global
journal
and
a
global
consistency
point
and
that
you're
rolling
forward
from
right.
D
B
B
Okay,
so
we're
kind
of
a
little
bit
over
dumb.
Should
we
switch
gears
and
talk
about
the
LMD
be
ldp
stuff?
Are
you
guys
here
sure
it's
sean
and
chintan
yeah.